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The present report is the twenty-second in a series describing 
reaearch in information organization and retrieval conducted by the 
Department of Computer Science at Cornell University. The report covering 
work carried out for approximately two years (summer 1972 to summer 1974) is 
divided into four parts: indexing theory (sections I to III), automatic 
content analysis (sections IV to VI), feedback searching (sections VII to 
IX), and dynamic file management (sections X to XII). 

The normal schedule in the distribution of ISR reports has not been 
maintained in recent years, due largely to the scarcity of publication funds. 
For the same reason, a number of recently published articles covering related 
research work are not being reprinted in the present report. Interested 
readers may want to refer to the following additional items in particular: 

a) Contributions to the Theory of Indexing (G. Salton, C.S. Yang, 
and C.T. Yu), Proc. IFIP Congress 74, North Holland Publishing 
Company, Amsterdam, 197U. 

b) On the Specification of Term Values in Automatic Indexing 

(G. Salton and C.S. Yang), Journal of Documentation, Vol. 29, 
No. 4, December 1973, p. 351-372. 

c) Proposals for a Dynamic Library (G. Salton), Information - Part 2, 
Vol. 2, No. 3, 1973, p. 5-25. 

d) Theory of Indexing and Classification (C.T. Yu), Doctoral Thesis, 
Cornell University, Technical Report 73-181, Department of 
Computer Science, Cornell University, Ithaca, N.Y., August 1973, 
238 pages* 
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Some time has been devoted during the last year to the design of 
an on- line implementation of the experimental SMART retrieval system, and 
test runs of the on-line version have been made on the IBM 370/158 computer 
at Cornell. The off-line version of the system continues to be used for 
experiments at various locations in the United States and abroad. 

In recent years, increasing attention has been paid to the study of 
a variety of file organization and retrieval algorithms, including some that 
have not yet found their way into operational implementation. Among these is 
the use of clustered file manipulations instead of inverted directory searches, 
vector matching processes instead of keyword coincidence counting, dynamic 
document space modification, automatic file retirement procedures, and interactive 
retrieval methodologies. 

The present report thus includes studies dealing with feedback searching 
and dynamic file modification. A great deal of emphasis has also been placed 
on the generation of new indexing theories which assign specific functions in 
content analysis to various indicators such as single terms, phrases, and 
thesaurus categories. These theories are explained in Part I of the present 
report . 

Sections I and II by G. Salton, A. Wong, and C.S. Yang, and by A. Wong, 
respectively, cover investigations relating the density of the document space 
to the retrieval effectiveness obtainable with such a space. In particular, 
the earlier work dealing with the determination of term discrimination values 
makes it appear that "good" terms — those indicative of information content — 
are those which increase the dissimilarity between documents, that is, which 
spread out the document space. The experimental output in section I and II 
confirms that a low-density space is associated with effective retrieval, and 
vice-versa. Similarly, a high-density space provudes poor retrieval performance. 




The theory in sections I and II is developed further in section 
III by G. Salton, C»S. Yang, and C.T. Yu relating the discrimination value of 
a term to its document frequency in a collection. The best discriminators 
are terms with medium document frequency. This fact is used to construct 
an optimum indexing vocabulary by turning high frequency single terms into 
phrases thereby reducing the document frequency , and assembling low frequency 
terms into thesaurus groups thus increasing the frequency. The effectiveness 
of the resulting indexing vocabulary is assessed by citing appropriate 
experimental evidence. 

Sections IV to VI, constituting Part 2 of this report, deal with 
various aspects of automatic content analysis* Section IV by R. Crawford 
covers the construction and effectiveness of a variety of negative dictionaries 
(••stop lists") containing terms that should not be used for content identification. 
This work leads to the generation of an indexing vocabulary of optim^om size. 
Section V by A. v.d. Meulen covers the operations of the so-called dynamic 
information values. In that system all term weights are fixed initially at 
some given value (say 1). Good terms, that is, those contained in useful 
documents are then increased in weight dynamically in the course of the 
operations. Bad terms are similarly demoted by reducing the term weights. 
The last section, number VI, by K. Welles deals with experiments leading to 
the construction of optimum term classifications (thesauruses) using the 
pseudo-classification method. This process utilizes a classification criteron 
based on user relevance assessments to group the terms rather than on the 
more usual semantic term similarities « 
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The next three sections ^ VII to IX ^ constitute Part 3 of this 
report, entitled feedback searching. Section VII by A» Wong, R. Peck, 
and A* v.d. Meulen attempts to determine relationships betvien the 
effectiveness of the initial content analysis (indexing) and the use- 
fulness of iterative feedback searching. It is found that differences in 
the effectiveness of the initial indexing are preserved during the feed- 
back operations. Section VIII by K. Sardana relates the length of the 
feedback query to the effectiveness of the retrieval operation. It is 
found that shorter feedback queries provide better retrieval; methods are 
therefore given for reducing feedback query length. A similar reduction 
in vector length is investigated by M. Kaplan in section IX, applied to the 
centroid vectors (profiles) representing the document groups in a clustered 
file organization. 

Part 4, consisting of sections X to XII covers dynamic document space 
modification procedures. In section X by C.S. Yang the term discrimination 
values are used as parameters in the construction of appropriate document 
space modification methods. A document "utility value", determined by earlier 
user-system interactions, is similarly used for document space modification 
in section XI by A. Wong and A. v.d. Meulen. Finally, in section XII by 
K. Sardana a variety of automatic document retirement methods can be used 
automatically to reduce the size of the collection by eliminating items 
exhibiting low usefulness. Three retirement methods based respectively on 
average term weight measurements, document space modification methods, 
and the storage of special usage indicators are examined and their 
effectiveness is evaluated. 



ERIC 



xiv 



All earlier ISR reports in this series are obtainable from the 
National Technical Information Service in Springfield, Virginia. The 
order numbers for the last few reports are PB 214-020 (ISR-21), 
PB 211-061 (ISR-20), PB 204-946 (ISR-19) and PB 198-069 (ISR-18), 
respectively. 

6. Salton 
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A Vector Space Model for Automatic Indexing 
Salton> A. Wong, and C.S* Yang''' 



Abstract 

In a document retrieval, or other pattern matching environment where 
stored entities (documents) are compared with each other, or with incoming 
patterns (search requests), it appears that the best indexing (property) 
space is one where each entity lies as far away from the others as possible; 
that is, retrieval performance correlates inversely with space density. This 
result is used to choose an optimum indexing vocabulary for a collection of 
documents. Typical evaluation results are shown demonstrating the usefulness 
of the model. 

1. Document Space Configurations 

Consider a document space, consisting of documents D^, each identified 
by one or mo:.?e index terms T^; the terms may be weighted according to their 
importance, or unweighted with weights restricted to 0 and 1.* A typical 
three-dimensional index space is shown in Fig. 1, where each item is identified 
by up to three distinct terms. The three dimensional example may be extended 
to t dimensions when t different index terms are present. In that case, 
each document is represented by a t-dimensi^nal vectav 

d^j representing the weight of the jth term. 

^Department of Co^nputer Science, Cornell University, Ithaca, N.Y., m853 

Alt hough we speak of dociaments and index terms, the present develoj^ent 
applies to any set of entities identified by weighted property vectors. 
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Given the index vectors for two documents, it is possible to compute 
a similarity coefficient between them s(D^, D^), reflecting the degree of 
similarity in the corresponding terms and term weights. Such a similarity 
measure might be an inverse function of the angle between the correi^poiullnn 
vector pairs — when the term assignment for two vectors is identical, the 
angle will be zero producing a maximum similarity measure. 

Instead of representing each document by a complete vector originating 
at the 0-point in the coordinate system, the relative position of the vectors 
is presOTved by considering only the envelope of the space. In that case, each 
document is graphically identified by a single point whose position is specified 
by the area where the corresponding document vector touches the envelope of 
the space. Two documents with similar index terms are then repre?;ented by 
points that are very close together in the space: obviously the distance 
between two document points in the space is inversely correlated with the 
similarity between the corresponding vectors. 

Since the configuration of the document space is a function of the 
manner in which terms and term weights are assigned to the various documents 
of a collection, one may ask whether an optimum document space configuration 
exists, that is, one which produces an optimum retrieval performance.''^! 

If nothing special is known about the documents under consideration, 
one might conjecture that an ideal document space is one where documents that 
are jointly relevant to certain user queries are clustered together, thus 
insuring that they would be retrievable jointly in response to the corresponding 

*Retrieval performance is often measured by parameters such as reca ll and 
precision , reflecting the ratio of relevant items actually retri ed, and 
of retrieved items actually relevant. The question concerning optimum 
space configurations may then be more conventionally expressed in terms of 
the relationship between docuunent indexing on the one hand, and retrieval 
perfcamance on the other. 

O x'-J 
ERIC 
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queries* Contrariwise, documents that are never wanted jjimultaneously would 
appear well separated in the documont apace. Such a situation is dopictoil 
in the illustration of Fig. 2, where the distance between two x's representing 
two documents is inversely related to the similarity between the corresponding 
index vectors. 

While the document configuration of Fig. 2 may indeed represent the 
best possible situation, assuming that relevant and nonrelevant items with 
respect to the various queries are separable as shown, no practical way exists 
for actually producing such a space, because during the indexing process, it is 
difficult to anticipate what relevance assessments the user population will 
provide over the course of time. That is, the optimum r:onfiguration is difficult 
to generate in the absence of a priori knowledge of the complete retrieval 
history for the given collection. 

In these circumstances, one might conjecture that the next best thing 
is to achieve a maximum possible separation between the individual documents 
in the space, as shown in the example of Fig. 3. Specifically, for a collection 
of n documents, one would want to minimize the fxonction. 

n n 

F = E Z s(D., D.), (1> 

i=l j=l * ^ 

where s(D^, D^) is the similarity between documents i and j. Obviously 
when the function of equation (1) is minimized, the average similarity between 
document pairs is smallest, thus guaranteeing that each given document may 
be retrieved when located sufficiently close to a user query without also 
necessarily retrieving its neighbors . This insures a high precision search 
output, since a given relevant ite.. is then retrievable without also retrieving 
a number of nonrelevant items in its vicinity. In cases where several different 

erJc 



< \ 




Groups of Relevant Items 



Individual Documents 



Ideal Document Space 
Fig. 2 
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relevant items for a given query are located in the same general area of the 
space, it may then also be possible to retrieve many of the relevant while 
rejecting most of the nonrelevant. This produces both high recall and high 
precision.* 

Two questions then arise; first, is it in fact the case that a 

separated document space leads to a good retrieval performance, and vice-versa 

that improved retrieval performance implies a wider separation of the documents 

in the space; second, is there a practical way of measuring the space separation. 

In practice, the expression of equation (1) is difficult to compute since the 

2 

number of vector comparisons is proportional to n for a collection of n 
documents • 

For this reason, a clustered document space is best considered, where 
the documents are grouped into classes « each class being represented by a 
class centroid* A typical clustered document space is shown in Fig. where 
the various document groups are represented by circles and the centroids by 
black dots located more or less at the center of the respective clusters.^ 
For a given document class K comprising m documents, each element of the 
centroid C may then be defined as the average weight of the same elements 
in the corresponding document vectors, that is 



1 ^ 

c . i E d... (2) 
D.eK 



Z 

*In practice, the best performance is achieved by obtaining for each user 
a desired recall level (a specified proportion of the relevant .Items); at 
that recall level, one then wants to maximize precision by retrieving as 
few of the nonrelevant as possible. 

"^^A number of well-known clustering methods exist for automatically generating 
a clustered collection from the term vectors representing the individual 
^ documents, [i] ^ 
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Corresponding to the centroid of each individual document cluster , a 
centroid may be defined for the whole document space. This main centroid, 
represented by a small rectangle in the center of Fig. 4, may then be obtained 
from the individual cluster centroids in the same manner as the cluster centroids 
are computed from the inuividual documents. That is, the main centroid of the 
complete space is simply the average of the various cluster centroidLi. 

In a clustered document space, the space density measure conr.jGting 
of the sum of all pairwise document similarities, introduced earlier as 
equation (1), may be replaced by the sum of all similarity coefficients 
between each document and the main centroid, that is 



Q ^ I s(C*, D.), (3) 
i«l 

where C* denotes the main centroid. Whereas the computation of equation (1) 
2 

requires n operations, an evaluation of equation (3) is proportional to n. 

Given a clustered document space such as the one shown 5n Fig. 4, it is 
necessary to decide what type of clustering represents most closely the 
separated space shown for the unclustered case in Fig. 3. If one assumes that 
documents that are closely related within a single cluster normally exhibit 



identical relevance characteristics with respect to most 
the best retrieval performance should be obtainable with 



user queries, then 
a clustered space 



exhibiting tight individual clusters., but large interclij/ster distances; 
that is, / 

a) the average similarity between {(airs of docxaments within a single 
cluster should be maximized, wliile simultaneously 

b) the average similarity between different cluster centroids is 
minimized. 
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Ta«^ reverse obtains for cluster organizations not conducive to good 
performance where the individual clusters should be loosely defined » 
whereas the distance between different cluster centroids should be small* 

In the remainder of this study » actual performance figures are 
given relating document space density to retrieval performance, and con- 
clusions are reached regarding good models for autonatic indexing. 

2. Correlation between Indexing Performance and Space Density 

The main techniques useful for the evaluation of automatic indexing 
methods are now well understood. In general, a simple straightforward 
process can be used as a base-line criterion — for example, the use of 
certain word stems extracted from documents or document abstracts, weighted 
in accordance with the frequency of occurrence (f^) of each term k in 
docximent i. This method is known as term-frequency weighting. Recall- 
precision graphs can be used to compare the performance of this standard 
process against the output produced by more refined indexing methods. 
Typically, a recall -precision graph is a plot giving precision figures, 
averaged over a number of user queries, at ten fixed recall levels, ranging 
from 0.1 to 1.0 in steps of 0.1. The better indexing method will of course 
produce higher precision figures at equivalent recall levels. 

One of the best automatic term weighting procedures evaluated as 
part of a recent study consisted of multiplying the standard term frequency 
weight f^ by a factor inversely related to the document frequency dj^ 
of the term (the number of documents in the collection to which the term is 
assigned). [2] Specifically, if is the document frequency of term k, 
the inverse document frequency IDF, of term k may be defined as [3] : 



I-ll 



(IDF)j^ = flogg i^l - r Logg + 1- 

A term weighting system px'oportional to (f> • IDF, ) will assign the largest 
weight to those terms which arise with high frequency in individual documents » 
but are at the same time relatively rare in the collection as a whole. 

It was found in the earlier study chat the average improvement in recall 
and precision (average precision improvement at the ten fixed recall points) 
was about 14 percent for the system using inverse document frequencies over 
the standard term frequency weighting. The corresponding space density 
measurements are shown in Table 1 using two different cluster organizations 
for a collection of 424 documents in aerodynamics: 

a) Cluster organization A is based on a large number of relatively 
small clusters , and a considerable amount of overlap between the 
clusters (each document appears in about two clusters on the 
average); the clusters are defined from the document -query relevance 
assessments* by placing into a common class all documents jointly 
declared relevant to a given user query. 

b) Cluster organization B exhibits fewer classes (83 versus 155) 
of somewhat larger size (6.6 documents per class on the average 
versus 5.8 for cluster organization A)i there is also much less 
overlap among the clusters (1.3 clusters per document versus 2.1). 
The classes are constructed by using a fast automatic tree-search 
algorithm due to Williamson. [4] 

A number of space density measures are shown in Table 1 for the two 
cluster organizations ^ including the average similarity between the documents 
and the corresponding cluster centre ids (factor x); the average similarity 
between the cluster centroids and the main centre id; and the average similarity 
between pairs of cluster centroids (factor y). Since a well-separated space 
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corresponds to tight clusters (large x) and large differences between different 
clusters (small y), the ratio y/x can be used to measure the overall space 
density. [5] 

It may be seen from Table 1, that all density measures are smaller for 
the indexing system based on inverse document frequencies; that is, zhe 
documents within individual clusters resemble each other less, and so do the 
complete clusters themselves. However, the **spreading out** of the clusters 
is greater than the spread of the documents inside eaca duster. This accounts 
for the overall decrease in space density between the two indexing systems. 
The results of Table 1 would seem to support the noti^a that improved recall- 
precision pe.*formance is associated with decreased density in the document 
space . 

The reverse proposition, that is, whether decreased performance implies 
increased space density may be tested by carrying out term weighting operations 
inverse to the ones previously used. Specifically, since a weighting system 
in inverse document frequency order produces a high recall-precision performance, 
a system which weights the terms directly in order of their document frequencies 
(terms occiArring in a large number of documents receive the highest weights) 
should be correspondingly poor. In the output of Table 2, a term weighting 
system proportional to (f^ • DF, ) is used, where f^ is again the term 
frequency of term k in document i, and DFj^ is defined as 10/(IDF)j^. The 
recall-precision figures of Table 2 show that such a weighting system produces 
a decreased performance of about ten percent, compared with the standard. 

The space density measurements included in Table 2 are the same as 
those in Table 1. For the indexing system of Table 2, a general ♦•bunching 
up** of the space is noticeable, both inside the clusters and between clusters. 
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However t the similarity of the various cluster centroids increases more than that 
between documents inside the clusters. This accounts for the higher y/x 
factor by 16 and 7 percent for the two cluster organizations, respectively. 

3. Correlation between Space Density and Indexing Performance 

In the previous section it was shown that certain indexing methods which 
operate effectively in a retrieval environment are associated with a decreased 
density of the vectors in the document space , and contrariwise that poor 
retrieval performance corresponds to a space that is more compressed. 

The relation between space configuration and retrieval performance may, 
however, also be considered from the opposite viewpoint. Instead of picking 
document analysis and indexing systems with known performance characteristics 
and testing their effect on the density of the document space, it is possible 
artifically to change the document space configurations in order to ascertain 
whether the expected changes in recall and precision are in fact produced. 

The space density criteria previously given stated that a collection of 
small tightly clustered documents with wide separation between individual 
clusters should produce the best performance. The reverse is true of large 
nonhomogeneous clusters that are not well separated. To achieve improvements 
in performance, it would then seem to be sufficient to increase the similarity 
between dociiment vectors located in the same cluster, while decreasing the 
similarity between different clusters or cluster centroids. The first effect 
is achieved by emphasizing the terms that are unique to only a few clusters, 
or terms whose cluster occurrence frequencies are highly skewed (that is, they 
occur with large occurrence frequencies in some clusters, and with much lower 
frequencies in many others). The second result is produced by deemphasizing 
terms that occur in many different clusters. 
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Two parameters may be introduced to be used in carrying out the 
required transformations [5] : 

NC(k) The number of clusters in which term k occurs (a 

term occurs in a cluster if it is assigned to at 
least one document in that cluster); 

and CF(k,j) the cluster frequency of term k in cluster j 

that is, the number of documents in cluster j in 
which term k occurs. 

For a collection arranged into p clusters, the average cluster frequency 
CF(k) may then be defined from CF(k,j) as 

CF(k) = i E CF(k,j). 

Given the above parameters, the skewness of the occurrence frequencies 
of the terms may now be measured by a factor such as 

F^ = |CF(k) - CF(k,j)|. 

On the other hand, a factor F^ inverse to NC(k) (for example, l/NC(k)) 
can be used to reflect the rarity with which term k is assigned to the 
various clusters. By multiplying the weight of each term k in each 
cluster j by a factor proportional to f'-j^ * ^2 ^ suitable spreading out 
should be obtained in the document space. Contrariwise, the space will be 
compressed when a multiplicative factor proportional to 1/F^ • F^ is used. 

The output of Table 3 shows that a modification of terro weights by 
the F^ • Fg factor produces precisely the anticipated effect: the similarity 
between documents included in the same cluster (factor x) is now greater, 
whereas the similarity between different cluster centre ids (factor y) has 
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decreased. Overall » the space density measure (y/x) decreases by 18 and 
11 percent respectively for the two cluster organizations. The average 
retrieval performance for the spread-out spdce shown at the bottom 
Table 3 is improved by a few percentage points. 

The corresponding results for the compression of the space using 
a transformation factor of 1/F^ • are shown in Table ^. Here the 
similarity between documents inside a cluster decreases » whereas the 
similarity between cluster centroids increases. The overall space den- 
sity measure (y/x) increases by 11 and 16 percent for the two cluster 
organizations compared with the space representing the standard term 
frequency weighting. This dense document space produces losses in recall 
and precision performance of 12 to 13 percent. 

Taken together > the results of Tables 1 to 4 indicate that retrieval 
performance and document space density appear inversely related > in the 
sense that effective (questionable) indexing methods in terms of recall 
and precision are associated with separated (compressed) document spaces; 
on the other hand^ artificially generated alterations in the space densities 
appear to produce the anticipated changes in performance. 

The foregoing evidence thus confirms the usefulness of the '*term 
discrimination** model and of the automatic indexing theory based on it. 
These ^ M/^stions are examined briefly in the remainder of this study. 



The Discrimination Value Model 

For some years > a document Indexing model known as the term dis- 
crimination model has been used experimentally. [2>6] This model bases 
the value of an index term on its "discrimination value" DV> that is> on 
an index which measures the extent to which a given term is able to 
increase the differences among document vectors when assigned as an index 
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♦ ♦tfttia to a given collection of documents. A "good" index term — one 
with a high discrimination valae — decreases the similarity between 
documents when assigned to the collection, as shown in the example of 
Fig. 5. The reverse obtains for the "bad" index term with a low 
discrimination value. 

To measure the discrimination value of a term, it is sufficient 
to take the difference in the space densities before and after assignment 
of the particular term. Specifically, let the density of the complete 
space be measured by a function Q such as that of equation (3); that is, 
by the sum of the similarities between all doctiroents and the space 
centroid. The contribution of a given term k to the space density may be 
asccKTtained by computing the function 

where is the compactness of the document space with term k deleted 
from all document vectors. If term k is a good discriminator, valuable 
for content identification, then Q^^ > Q, that is, the document space after 
removal of term k will be more compact (because upon assignment of that 
term to the documents of a collection the documents will resemble each other 
less and the space spreads out). Thus for good discriminators - Q > 0; 
the reverse obtains for poor discriminators for which Q]^ ' Q ^ ^* 

Because of the manner in which the discrimination values are defined, 
it is clear that the good discriminators must be those with uneven occvo^rence 
Apequency distributions which cause the space to spread out when assigned by 
decreasing the similarity between the individual documents. The reverse is 
true for the bad discriminators. A typical list including the ten best 
terms and the ten worst terms in discrimination value order (in order by the 
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Q. - Q value) is shown in Table 5 for a collection of U25 articles in world 
affairs from Time magazine. A total of 7569 terros are used for this collec- 
tion, exclusive of the common English function words that have been deleted. 

In order to translate the discrimination value model into a possible 
theory of indexing, it is necessary to examine the properties of good and 
bad discriminators in greater detail. Fig. 6 is a graph of the terms assigned 
to a sample collection of *iSO documents in medicine, presented in order by 
their document Iiequencies, For each class of teims — those of document 
frequency 1, document frequency 2, etc. ... — the average rank of the 
correspoading terms is given in discrimination value order (rank 1 is assigned 
to the best discriminator and rank »*726 to the worst term for the U726 terms 
of the medical collection). 

Fig. 6 shows that terms of low document frequency — those that occur 
in only one, or two, or three documents — have rather poor average discrim- 
ination ranks. The several thousand terms of document frequency 1 have an 
average rank exceeding 3000 out of H726 in discrimination value order. The 
terms with very high document frequency — at ItsdSt one term in the medical 
collection occurs in as many as 138 documents out of 450 — are even worse 
discriminators; the terms with document frequency greater than 25 have average 
discrimination values in excess of **000 in the medical collection. The best 
discriminators are those whose document frequency is neither too low nor too 
high. 

The situation relating document frequency to term discrimination value 
is summarized in Fig. 7. The 4 percent of the terms with the highest document 
frequency, representing about 50 percent of the total term assignments to the 
documents of a collection, are the worst discriminatoi-s. The 7) percent of 
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the terms with the lowest document frequency are generally poor discrim- 
inators. The best discriminators are the 25 percent whose document freq- 
uency lies approximately between n/100 and n/10 for n documents. 

If the model of Fig. 7 is a correct representation of the situation 
relating to term importance, the following indexing strategy results [6,71: 

a) Terms with medium document frequency should be used for content 
identification directly, without further transformation. 

b) Terms with very high document frequency should be moved to the 
left on the document frequency spectrum by transforming them 
into entities of lower frequency; the best way of doing this 
is by taking high-frequency terms and using them as components 
of indexing phrases — a phrase such as "programming language" 
will necessarily exhibit lower document frequency than either 
"program", or "language" alone. 

c) Terms with very low document frequency should be moved to the 
right on the document frequency spectrum by being transformed 
into entities of higher frequency; one way of doing this is by 
collecting several low frequency terms that appear semantically 
similar and including them in a common term (thesaurus) class. 
Each thesaurus class necessarily exhibits a higher document 
frequency than any of the component members that it replaces. 

The indexing theory which consists in using certain elements extracted 
from document texts directly as index terms, combined with phrases made up 
of high frequency components and thesaurus classes defined from low frequency 
elements has been tested using document collections in aerodynamics (CRAN), 
medicine (MED), and world affairs (TIME). [2,6,7] A typical recall-precision 
plot showing the effect of the right -to-left phrase transformation is shown in 
Fig. 8 for the Medlars collection of 450 medical documents. When recall is 
plotted against precision, the curve closest to the upper right-hand corner 



ERLC 



1-27 




1-28 



of the graph (where both recall and precision are close to 1) reflects 
the best performance. It may be seen from Fig. 8 that the replacement 
of the high-frequency nondiscriminators by lower frequency phrases 
improves the retrieval performance by an average of 39 percent (the 
precision values at the ten fixed recall points are greater by an average 
of 39 percent). 

The performance of the right-to-left (phrase) transformation and 
left-to-right (thesaurus) transformation is summarized in Table 6 for the 
three previously mentioned test collections. The precision value^^ obtain- 
able are near 90 percent for low recall , between M-O and 70 percent 
for medium recall , and between 15 and 45 percent at the low recall end 
of the performance spectrum. The overall improvement obtainable by 
phrase and thesaurus class assignments over the standard term frequency 
process using only the unmodified » single terms ranges from 17 percent 
for the world affairs collection to 50 percent for the medical collection. 

A conclusive proof relating the space density analysis and the 
resulting document frequency indexing model to optimality in the retrieval 
performance cannot be furnished. However » the model appears to perform 
well for collections in several different subject areas , and the perform- 
ance results produced by applying the theory have not in the authors* 
experience been surpassed by any other manual or automatic indexing and 
analysi pro«:edures tidied in earlier experiments. The model may then 
lead to the best performance obtainable with ordinary document collections 
operating in actual user environments. 
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An Investigation on the Effects cf 
Different Indexing Methods on the 
Document Space Configuration 
Anita Wong 

Abstract 

An attempt is made on the present study to gain a better under- 
standing of the document space configuration through the use of 
clustered document collections and diff^ent indexing methods. 

!• Introduction 

Previous work in automatic indexing and clustering in information 
retrieval has mostly been done with the thought of improving the recall and/or 
precision of the search result. Not too much work has been done to gain 
a fuller understanding of the document space configuration itself, presumably 
because this is not directly related to the improvement of the effectiveness 
of the system. However it is quite likely that the configuration of the 
document space does correlate in some way with the effectiveness of the 
system. 

It is natural for documents that are related to be closely similar 
to each other.* But is this really the case for any indexing method? Or 
is it possible that documents that are related are scattered throughout the 
document space and surrounded by extraneous documents which are more or less 
closely packed in groups? 



* Experimental results were performed by Jones [1] . 
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The problem would be easier to answer if the meaning of closeness 
were better defined • Closeness should bear a different meaning in diffuront 
systems > depending on the way the documents are retrieved. On a book shelf, 
two books are said to be close together if they are physically closo t\v.ether. 
Close is defined in this way because whenever one book is located, thv other 
is also found. In automatic retrieval, closeness would be proportlonaJ to 
the matching function used for document retrieval. The physical distance 
between the documents is less important in this respects 

In the SMART system, a document is retrieved by a query if the 
similarity between the document and query is high. If the similarity relation 
is assumed transitive, then documents relevant to the same query should be 
similar to each other. It is therefore inconceivable that any indexing 
method should place the related documents in any way other then "closo" 
together. However merely placing the related documents close together does 
not necessarily guarantee good system performance. The unrelated docuinonti> 
should be farther apart then the related ones. In other words, ideally, 
documents should form clusters which do not overlap; documents within a 
cluster are related while those in different clusters are not; and the 
distance between two documents within a cluster is shorter than two 
documents belonging to two different clusters. Consequently, a good indexing 
method should index the documents in such a way that the related documents 
are close together and non-related ones further apart. 

Many different indexing methods were tented over the years in tho 
SMART system, and recall and preci:>ion figures wore generated. It is not 
known if the indexing methods that produce good performance do in fact place 
the documents in the way predicted above, or if the changes in configuration 
of the document space can be explained in some other way. It is the aim 
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of the present work to elucidate the relationship between the configuration 
of the document space with the performance of the system with different 
indexing methods. 

2. Methodology 

An obvious way to solve the problem is to look dt each document, and 
at its relation with the other documents. This requires a computation of the 
correlation of every document with every other documents, thus obtaining for 
each indexing method a full document-<iocument matrix. These matrices are 
then compared row-wise in order to ascertain the relative changes of 
the documents with respect to each other. This method is- not employed 
here because the number of documents involved is usually large. Instead, 
the documents are grouped, and each group i6 treated as an entity with 
respect to the documents not in the same group. Judging the movements of 
groups of documents instead of individual items alone may be justified because 
it is rather pointless to look at the motion of each document with respect 
to every other, since there are so many documents that individual effects 
would be hardly noticeable. 

The discrimination value model [2] has been found to give improvements 
in search performance. However, the discrimination values are determined by 
lowering the sum of the correlations between the documents and their 
centroid; in other words, the discrimination value is a function of the 
distance between document vectors. The application of the discrimination 
values would tend to increase the average distance between documents, 
regardless of the relations between the documents. But it may be unreasonable 
to have related documents farther apart. Consequently, it is important to 
relate the average increase in distance between similar documents to the 



averag>i increase in distance between all documents* 

Since the investigation is performed on clustered document 
collections* the clustering methods to be used are discussed first. 

Previous experiments were done with clusterin^^, methods that 
produce clusters of different sizes with a large variance.* It was 
found that a large cluster is represented by a longer centroid vec or* 
In some cases » the centroid vectors were so long that they each included 
75% of the terms occurring in the collection. The correlations between 
these centroidSt being high* have become less meaningful in distinguishing 
one centroid from another. The problem could be overcome by using a 
different method in forming the c ntroid vectors as in Murray and 
Kerchner [3] » [4] > or by deleting some common terms. But the centroid::; 
would then lack some of the terms occurring in the documents and thus the 
centroid would not be the true centers of the document clusters. For this 
reason clustering algorithms that produce smaller clusters are considered. 
The clustering methods to be used are: 

A. For each query, one cluster is constructed. The cluster contains 
documents that are relevant to that query. This method ia 
chosen becduse the clusters are easy to obtain in an experimental 
environment and also because it produces small clusters as shown 
in Table 1. This method will be referred to as method RCL. 



* The centroid vector of a cluster is formed by summing the normalized 
document vectors. Let D. i = 1 . . . m be documents belonging to cluster 
c then the centroid vector for m D. . 
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B. Williamson* s Clustering Algorithm. This method is essentially a 
tree building procedure, which has a bound on the number of sons 
each node may have. This method is used because it also produces 
small clusters. This method will be referred to as method 
SKIP. 151 



Clustering Method 


RCL 


SKIP 


Number of Clusters 


155 


83 


Average number of documents in one cluster 


5.8 


6.6 


Number of documents in the largest cluster 


22 


o 


Number of docui..ents in the smallest cluster 


3 


4 


Sum of the numb^ of documents in all clust^s 


900 


5U7 



Statistics for the two Clustering Methods 
Table 1 

3. Cluster Measurements 

A number of measurements are perfomed on the clusters. 
Notation: 

= the main centroid of the entire collection 
« the ith cluster centroid 
= the jth document 
= the jth cluster 
= the number of documents in 
= the number of clusters. 



N 



II-6 



The measurements are: 



1. The average correlation of the documents in cluster D. with 

X 



their centroid, 



T. 

R,+D. cosine(R.,C.) 



A. = 



The average for all the clusters 

a = Z A./N 
. 1 

X 

2. The correlation between cluster centroid C. with the main 



centroid, C 

B. = cosine(C.,C) 

X X* 

and the average of the B.'s 
b = ZB./N 

X 

3. The correlation between two cluster centroids 

C. . = cosine(C. ,C. ) 
and the average of the C's 

C = Z Z C../N^ 

• • 13 

1 3 

»f. The ratio: ^he ave. corr. of the docs, with their centroid 

the ave. corr. between cluster centroids and main centroid 

= a/b 

5. The ratio* documents with their centroid 

Ave. corr. between cluster centroids 



^. The Experiment 

The experiment was performed using the Cranfieid Thesaurus 
Collection. The collection was first clustered using Williamson's Algorithm 
SKIP, and then using the query relevant method, RCL. The Q values for 
these two clustered document collections were calculated for the Ulfiereiit 
indexing methods listed below. 

1. Cranfieid 424 Thesaurus Collection* 

2. Cranfieid 424 Thesaurus Collection with the application of 
discrimination values. The concept weight of the document 
vectors were multiplied by the discrimination values rescaled 
to an effective range. 

3. Cranfieid 424 Thesaurus Collection with the application of 
inverse document frequency. The concept-woight of the document 
vectors were modified by multiplying a function inversely 
proportional to the document frequency (DF) of the term, i.e. 
new concept weight - old concept weight x [log r?r + 0 • 

This model emphasizes the low frequency terms, and deemphasizes 
the high frequency terms. 

4. An indexing method that does not perform as well as the control 
method (method 1). To create the collection, the Cranfieid 424 
Thesaurus collection is modified by deleting om^ hundred terms 
which have discrimination values higher than 0.G4. By deleting 
the high discrimination value Lernns, it is evident that the 
document vectors will be moved towax?ds each other* However 

the essential changes of the orientation of the document vectors 
have yet to be determined. This method ib referred to as HPVD. 

An indexing method created for the purpose of this work. It is 
believed that a good indexing method would place the documents 
into natural clusters with the inter-cluster distanre relatively 
large. To achieve this result, the documents were modified so as 
to diminish the distance between documents within each cluster. 
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while increasing the length of the intcr-cluster 
distances. To diminish the distance between documents 
within a t„^uster» the terms to be emphasized must be 
those that are unique to a few clusters* that is, terms 
that have low cluster frequency should be emphasized to 
increase the correlation between documents within those 
few clusters. To decrease the correlation between clusters, 
terms that occxir in relatively more clusters are deemphasized. 
Using these two criteria, a value, val(t) is determined 
for each concept t. The document collection is modified by 
multiplying the original concept weights by this value. The 
actual procedure is as follows: 



ERIC 



1) For each cluster the document frequency of each 
term t in the cluster is found. This is denoted by 
CLUSFREQ(t,j) for term t and cluster j. 

2) For each term the number of different clusters in 
which it occurs (i.e. the cluster frequency of each 
term) is found. It is denoted by NCLUS(t). 

3) The value » val(t)» for term t is determined by 
the equation: val(t) = TMULT(t) x DAC(t), where 
TMULT(t) is a step function which is inversely 
proportional to the cluster frequency of term t and 
DAC(t) is a function proportional to the skewness of 

a term with respect to the clusters. They are defined 
by the following equations: 

Given that STEP and LOWLIM are some integers 



TMULT (t) =^ 



2 


if 


1 


< 


NCLUS(t) 


< 


STEP 


1.75 


if 


STEP 


< 


NCLUS(t) 


< 


2 X STEP 


1.50 


if 


2 X STEP 


< 


uCLUS(t) 


< 


3 X STEP 


1.25 


if 


3 X STEP 


< 


NCLUS(t) 


< 


U X STEP 


1.00 


if 


k X STEP 


< 


NCLUS(t) 


< 


LOWLIM 


.50 


if 


LOWLIM 


< 


Nc:iUS(t) 
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AVECLUS(t) = ( 



Z 

jecluster 



CLUSFREQCt , 3 ) )/NCLUS( t ) . 



DAC(t) 



= ( 



j ecluster 



+ 1 



[|AVECLUS(t) - CLUSFREQ(t,j) I 
] }/NCLUS(t) 



AVECLUS(t) is the average document frequency for term t 
and DAC(t) is the average of the sum of the deviation 
of document frequen. y in the cluster from the average. 
Examples : 

1. a term t occurs in 3 clusters once in each: 



AVECLUS(t) =3/3=1 
DAC(t) = (1 + 1 + l)/3 = 1 
2. a term occurs in 3 clusters. 

CLUSFREQ(2,1) = 3 
CLUSFREQCt, 2) = 1 
CLUSFREQCt, 3) = 1 

AVECLUSCt) = :3 t 1 + l)/3 = 1.2. 

DACCt) = (|l.7 - 3| + 1) + C|l.7 - ll + 1) + 



The collection thus modified will be referred to as 
MOD CSTEPjLOWLIM). 

The values 5 and 50 were used for STEP and LOWLIM. 
They were chosen to make TMULT > 1 for approximately 
half of the terms in the collection clustered ucing "GKIP". 
With the apparent better result obtained from the 
Relevant clustered collection, STEP and LOWLIM were 
changed to 3 and 38 respectively, for the cluster collection 
using SKIP, so that the number of terms with the same 
TMULT value is approximately the same for SKIP M0BC3,38) 
and RCL MODC5,50). 



CLUSFREQCt, i) - 1 



i = 1,2,3 



(|l.7 - ll + 1) }/3 
= (2.3 + 1.7 + 1.7)/3 = 1.9. 
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6. For the sake of completeness another indexing method inverse 
of the MOD(5,50) was tried to move the documents within a 
cluster further away from each other as well as the clusters 
to each other. 

1. MODI(l) 

Similar to the MOD method the mod inverse collectio'; is 
obtained by multiplying the original concept weiglits 
of the document vectors by a step function IVAL(t), 
defined as: 

IVAL(t) = ITMULT(t). 

^ .5 if 1 < NCLUS(t) < 20 
ITMULT(t) = < 1.0 if 20 < NCLUS(t) <_ 50 

V. 2.0 if 50 < NCLUS(t) 

2. M0DI(2) 

In M0DI(2) the skewness factor, DAC(t), in MOD is also 
taken into consideration. IVAL(t) is defined as: 

IVAL(t) = ITMULT(t) x (4/DAC(t)) 

DAC(t) ranges from 1 to 4, thus 4 is picked here to 
keep (4/DAC(t)) in the same range. 

5. The Results 

The results can be summarJ.z.id by Tables 2, 3 which is broken 
down into Tables 4 to 8 for clarity. The quantities a, b, c used here 
are the same as those in Tables 2 and 3 and explained in page 6, 
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control 


Dis 


IDF 


M0D(5, 
50) 


M0D(3, 
38) 


HDVD 


MODl(l) 


M0D1(2) 


a 


.65 


.603 


.589 


.649 


.653 


.664 


.690 


.645 


b 


.537 




.492 


.522 


.528 


.611 


.599 


.581 


c 


.315 


.237 


.252 


.274 


.281 


.375 


.385 


.364 


1 


1.21 


1.25 


1.20 


1.24 


1.24 


1.06 


1.15 


1.11 


«2 


2.0 


2.5 


2.3 


2.3 


2.32 


1.77 


1.70 


1.77 



The Summary of the Results for SKIP Clustered Collection 

Table 2 





control 


Dis 


IDF 


M0D(5, 
50) 


HDVD 


MODKl) 


M0DI(2) 


a 


.712 


.689 


.668 


.73 


.712 


.725 


.681 


b 


.50 


.433 


.454 


.477 


.579 


.577 


.523 


c 


.273 


.192 


.209 


.229 


.336 


.312 


.290 


«i 


1.42 


1.6 


1.57 


1.53 


1.23 


1.3 


1.1 


«2 


2.6 


3.5 


3.2 


3.1 


2.1 


2.32 


2.35 



The Summary of the Results for RCL Clustered Collection 

Table 3 
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SKIP 


RCL 


a 


b 


c 


a 


b 


c 


control 


.65 


.537 


.315 


.'^12 


.60 


.273 


dis 


.603 




.237 


.689 


.433 


.192 


control 
dis 


1.08 


1.11 


1.33 


1.03 


1.15 


1.42 



Control vs Discrimination Values 
Table 4 



1. Discrimination Value Model 

With the application of the discrimination values the average inter- 
cluster correlation (quantity b) and the average cluster and centroid 
correlation (quantity c) were lower than the corresponding ones for the 
control. This implies that the distances between clusters are lengthened 
with the application of the discrimination values, this results in a more 
spread out space. However, the document and cluster centroid correlations 
(quantity a) are also smaller, that is, within a cluster the discrimination 
values also cause the documents to spread out. Nevertheless, the 
control/discrimination value ratio for quantity a is smaller than those for 
quantity b and c in Table 4, implying that the expansion within a cluster 
is comparatively less than the expansion of the npace itself. Furthermore, 

the and values of Table 2 and 3 for the discrimination value 

1 2 

model arc larger than the corresponding ones for the control. Therefore, 
with the use of discrimination values the space is more spread out and 
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although the absolute nlaeo of Iho clustei*r» are larger , relative 
to the size of the entire space the clusters are smaller* 

2. Inverse Document Frequency 

With the application of inverse document frequency all three 
quantities a» b and c are smaller, again indicating a more spread out 
document space. Except for the value for SKIP'S IDF collection the 
Q value-; for tha IDF are larger than the control. The exception -mny be 
due to the non-relevant overlapping of SKIP'S clustered collection. Aside 
from this exception, the results suggest the same conclusion as the 
results for the discrimination value collection. 





SKIP 


RCL 


a 


b 


c 


a 


b 


c 


control 


.650 


.537 


.315 


.712 


.50 


.273 


IDF 


.589 


.^92 


.252 


.668 




.209 


control 
IDF 


1.10 


1.09 


1.21 


1.07 


1.10 


■ 1.31 



Control vs Inverse Document Frequency 
Table 5 

3. The HDVD Collection 

A sxirprising result in this case is tbit for relevant cluster 
quantity the average correlation between documents and ^heir cluster 
is the same as that in the control collection. Although the actual 



correlations do vary for each cluster ^ on the average the clusters do 
not expand nor contract. In spite of the fact that with SKIi clustered 
collection quantity ac for the HDVD method is the largest, it is only a 
moderate increase over the control collection. It seemg^to denote that 
the discriminating term affects the inter-cluster distances more than tho 
clusters themselves. 

With the HDVD method all three quantities are larger or stay 
unchanged. The control/HDVD ratio is largest for quantity a, showing that 
the least changes occur in the individual clusters. Both Q values are 
decreased. All the changes were opposite from what was observed in the 
previous two cases. With the entire document space contracted and the 
individual clusters unchanged or less contracted. The HDVD method results 
in forming larger clusters with respect to the document space than the 
control collection. 





SKIP 


RCL 




a 


b 


c 


a 


b 


c 


r — '" 

control 


.650 


.537 


.315 


.712 


.50 


.273 j 


HDVD 


.654 


.511 


.375 


.712 


.579 


* 1 
.336 


control 
HDVD 


.91 


.88 


.84 


1.0 


.85 


.81 

i 



Control and HDVD 
Table 6 



11-15 





SKIP MOD (5, 50) 


SKIP MOD (3, 38) 


RCL MOD (5, 50) 


a 


b 


c 


a 


b 


c 


a 


b 


c 


control 


.650 


.537 


.315 


.650 


.537 


.315 


.712 


.50 


.273 


MOD 


.6*19 


.522 


.274 


.653 


.528 


.281 


.73 


.^77 


.229 


control 
MOD 


1.0 


1.03 


1.5 


1.0 


1.02 


1.12 


.98 


1.05 


1.20 



Control and the MOD Methods 
Table 7 

^. The MOD Methods 

This method was created to decrease the distances between documents 
within a cluster and to increase the distances between clusters* Even 
though the control/MOD ratio of quantity a for all three cases, is less than 
or equal to one, only in one case-relevant cluster MOD (5|50) - is the 
size of the clusters significantly decreased. Of the other two cases, one 
is decreased slightly, and the other is increased slightly. However, 
quantities b and c were decreased so that the inter-cluster distances of 
all tliree cases increased significantly enough to produce larger and 
values. 
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HKIP 


RCL 


a 


b 


c 


a 


b 


c 


control 


.650 


.537 


.315 


.712 


.50 


.273 


MODI (1) 


.690 


.599 


.385 


.725 


.557 


.312 


MODI (2) 


.645 


.581 


.364 


.681 


.523 


.290 


control 
MODI (1) 


.94 


.89 


.82 


.98 


.90 


.88 


control 
MODI (2) 


1.0 


.92 


.87 


1.1 


.96 


.94 



Control vs MODI's 
Table 8 

5. The MODI Methods 

From the control/MODI ratios we can see that the inter-cluster 
distances (quantities b and c) are lengthened* The results for M0DI(2) 
is more satisfying since the cluster sizes increased as we hoped. 
Although the cluster sizes for M0DI(2) decreased slightly, the MODI 
methods do give a smaller and values as we expected. 

6# Conclusions 

1. The fact that the distance between two centroids increases 
implies that on the average the distance between any document in one centroid 
and any other document in another centroid increases also. With the 
application of either the discrimination values or the inverse document 



frequency, the distance between document ^3 increased. Moreover, the MOD 
collections were constructed such that the inter-cluster distances arc 
larger. All these indexing methods are found to have better searcli 
performance than the control method, as shown in Table 9. On the other 
hand, the HDVD model and the MODI models show that a more contracted 
document space produces deterioration in search performance. For the 
discrimination model, the result that the distances between document 
increase is obvious. However, it is not obvious for the inverse document 
frequency model, where the correlations between documents are not taken 
into consideration during construction. It can be concluded that a 
spread out document space is beneficial for the retrieval performance. 

2. Indexing methods that have been found to produce improvements 
in search performance as measured by recall and precision, do have relatively 
more compact clusters with respect to the entire collection space. This 
conclusion .is justified by the observation that, in all cases (the 
discrimination value, inverse document frequency and MOD) the increase 
in distance between centroids is more than the expansion of the cluster 
itself; whereas fw the indexing methods that give less good search 
performance the relative sizes of the clusters are larger. However, the 
Q-j^ and values can not be depended upon to evaluate the search 

performance of a collection in place of recall and precision. The 

and values of the discrimination value model are larger than those 

of the inverse document frequency model, but from the search result^- in 
Table 9, the discrimination value model does not necessarily perform better 
than the inverse document f^equ'^ncy model. Similarly, the HDVD method 
performs worse thanM0DI(2), but the Q value of M0DI(2) for RCL 
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Clustered collection is smaller than that of the HDVD method. Furthermore, 
the difference iii search performance between HDVD and control is more than 
that between th€ control and MOD method » but the differences in Q values 
are nearly the same. Nevertheless , the Q values do give a fairly 
compatible ranking of the performance of the various methods comoared to 
the searches performed as in Table 9. 

3. With the application of the HDVD method $ the average correlation 
of documents to their centroids over all clusters does not vary for the 
relevant clustered collection. This is an indication that the deletion of 
high discrimination value terms » nearly one seventh of Lhe concepts in the 
collection, has little effect within clusters. On the other hand, the sizes 
of the clusters vary with the use of the inverse skewness factor , as shown 
in MODI(l) and M0DI(2) in Table 8, To a lesser extent the discrimination 

value model and the inverse document frequency model are also empbi^s;i zing 
the different effects of the terms • It L«cuiaes apparent that there are 
various functions played by the terms under diffarenl conditions, similar 
to the idea that there are terms that promote recall and others that promote 
precision. It is now obvious that terms are needed to distinguish documents 
that belong to different clusters; whereas within a cluster, the terms 
needed should have less discriminating power or should only discriminate 
dociunents within the cluster. 
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A Theory of Term Importance in Automatic 
•''ext Analysis 
G. Salton*, C.S. Yang-\ and C^T^ Yu"^ 

Abstract 

Most existing automatic couLefit analysis and indexing techniques 
are based on word frequency characteristics applied largely in an ad hoc 
manner. Contradictory requirements arise in this connection* in that terms 
exhibiting hi^^h occurrence frequencies in individual documents are often 
useful for high recall performance (to retrieve many relevant items), whereas 
terms with low frequency in the whole collection are useful for high precision 
(to reject nonreievant items)* 

A new technique, knovn as discrimination value analysis ranks the text 
words in accordance with how well they are able to discriminate the documents 
of a collection from each other; that is, the value of a term depends on how 
much the average separation between iniividual documents changes when the eivcn 
term is assigned for content identification. The best words are those which 
achieve the greatest ^^rporatioa. 

The discrimination value analysis accounts for a number of important 
phenomex^ in the content analysis of natural language texts: 

a) the role and importance of single words; 

b) the role of juxtaposed words (phrases); 

c) the role of word groups or classes, as specified in a thesaurus. 
Effective criteria can be given for assigning each term to one of these three 
classes, and for constructing optimal indexing vocabularies. 



* Department of Computer Science, Cornell University, Ithaca, N.Y. 1^853 
+ Department of Computer Science, Univers^ity of Alberta, Edmonton, Alberta 
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The theory is validated by citing experimental results. 

1. Document Space Configuration 

Consider a collection of entities D (documents) represented by 
weighted properties w. In particular, let 

°i ^^il* ^12* •••• ^it^ 

where w^^^ represents the weight of term j in the vectv^r corresponding 
to the ith document . Given two documents D and D . , it is possible to 
define a measure of relatedness s(D^, D^) between the documents depending on 
the similarity of their respective term vectors. In three dimensions (when 
only three terms identify the documents), the situation may be represented by 
the configuration of Fig. 1, where the similarity between ai:y two of the document 
vectors may be assumed to be a function inversely related to the angle between 
them. That is, when two document vectors are exactly the same, the corresponding 
vectors are superimposed and the angle b'^itween them is zero. 

When the dimensionality of the space exceeds three, that is when more 
than tfiree terms are used to identify a given document, the envelope of the 
vector space may be used to represent the collection as in the example of 
rig< ?. Here only the tips of the document vectors are shown, represented by 
x's, and the distance between two x*s is inversely related to the similarity 
between Che corresponding document vectors — the smaller 'the distance between 
x*s, the smaller will be the angle between the vectors, and thus the more 
similar the term assignments. 



Ill 




Vector Representation of 
Document Space 
Fig. I 



O Centroid of Space 
X Individual Document 



Multidimensional Document Space 
Fig. 2 
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A central document, or centre id C, may be introduced, located in the 
center of the document npace, which tor certain purposes may represent the 
whole collection. The ith vector element c^ of thp centrolr) ran nimply 
be defined as the average of the ith term w. ^ across the n documents of 
the collection; that is 

1 " 
c. = — Z w. . . 
1 n 

It is clear that a particular document space configuration, such as 
that of Fig. 2, reflects directly the details of the indexing chosen for the 
identification of the documents. This raises the question about the choice 
of an optimum indexing process, or alternatively, about an effective document 
space configuration. A number of studies, carried out over the last few years, 
indicate that a good document space is one which maximizes the average separation 
between pairs of documents. [1,2] In particular, the document space will be 
maximally separated, when the average distance between each document and the 
space centroid is maximized, that is, when 

n 

Q = E s(C, D.) (2) 

i=l 

is minimum. Obviously, in such a case, it may be easy to retrieve each given 
document without also necessarily retrieving its neighbors. This insures a 
high precision output, since the retrieval of a given relevant item will then 
not also entail the retrieval of many nonrelevant items in its vicinity. 
Fiorthermore , when the relevant documents are located in the same general area 
of the space, high recall may also be obtainable, since many relevant items 
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may then be correctly retrieved , and many nonrelevant correctly rejected** 

A particular indexing system $ knovm as the discrimination value 
model assigns the highest weight , or value ^ to those terms which cause the 
maximum possible separation between the documents of a collection. This 
model is described and analyzed in the remainder of this study* 

The Discrimination Value Model 

The discrimination value of a term is a measure of the changes in 
space separation which occur when a given term is assigned to a collection 
of documents. A good discriminator is one whicn when assigned as an index 
term will render the documents less similar to each other; that is, its 
assignment decreases the space density. Contrariwise, a poor discriminator 
increases the density of the space. By computing the space densities both 
before and after assignment of each term, it is possible to rank the terms 
in decreasing order of their discrimination values. 

In particular, consider a measure of the space density, such as the 
Q value given in equation (2), and let Qj^ represent the density Q with 
the kth term removed from all document (and from the centroid) vectors. The 
discrimination value of term k may then be defined as 

DVj^ = Qj^ - Q. (3) 



* Retrieval performance is often measured by parameters such as recall and 
precision , reflecting the ratio of relevant items actually retrieved, and 
of retrieved items actually relevant . 
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Obviously, if term Q is a good discriminator, then its removal will cause 
a compression in the document space (an increase in space density), because 
its assignment would have resulted in an increase in space separation. Thus 
for good discriminators Qj^ > Q and DV^^ is positive. The reverse is true 
for poor discriminators whose removal causes a decrease in space denruty, 
leading to negative discrimination values. A vast majority of the terms may 
be expected to produce neither increase nor decrease in space density; in 
such a case a discrimination value near zero is obtained. The operations of 
a good discriminator are illustrated in the simplified drawing of Fig. 3. 

In the retrieval experiments conducted earlier with three collections 
in aerodynamics (Cranfield collection, 424 documents comprising 2651 distinct 
terms), medicine (Medlars collection, 450 documents comprising 4726 terms), 
and world affairs (Time collection, 425 documents comprising 14098 terms), the 
discrimination value model produced excellent retrieval results. [1] In 
particular, a term weighting system which assigns to each term k a value 
Wj^, consisting of the product of its frequency of occurrence in document 
j (fj^j) multiplied by its discrimination. value DVj^, 

produces recall and precision improvements of about ten percent over 
methods where only the term frequencies f, . are taken into account.* 



Terms receiv. ng high weights according to exprefjsion (u) are those which 
exhibit high occurrence frequencies in certain specified documents, cind 
at the same time can distinguish the^e documents from the remainder of 
the collection. 



Before Assignment 
of Term 



After Assignment 
of Term 



X Document 

O Main Centroid 

Operation of 
Good Discriminating Term 

Fig. 3 
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It may be of interest to inquire what kind of terms are favored 
by a weighting system such as that of expression (U), and what accounts for 
the value of the discrimination model. Some experimental evidence relating 
the discrimination values to certain frequency characteristics of the terms 
in the document collections is presented in the next section. This in turn, 
leads to an indexing theory to be examined in the remainder of this study. 

3. Discrimination Values and Document Frequencies 

Consider any term k assigned to a collection of documentn, and let 
d, be its document frequency, defined as the number of documents in the 

K 

collection to which term k is assigned. More specifically, 

where . ^ 1 whenever f, . >. 1, and b, . = 0 otherwise. It is instructive to 
arrange the terms assigned to a document collection into disjoint sets in such 
a way that the terms assigned to a given set have equal document frequencies 
d, = I. Moreover, for each such set of terms the average rank in decreasing 
discrimination value order may be computed, thereby relating document frequencies 
with discrimination values.* 

A plot giving the average discrxmination value rank for the terms 
exhibiting certain document frequency ranges is shown in Figs. 4(a), (b), and 
(c) for the collections in aerodynamics, medicine, and world affairs 
(Cranfield, Medlars, and Time) respectively. It may be seen that a U nhaped 
curve is obtained in each case^ with the following interpretation: 



* For a set of t terms, the discrimination value rank ranges from 1 for the 
best discriminator to t for the worst. 
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a) the terms with very low dociiment frequencies^ located on the 
left-hand side of Fig* 4 are poor discriminators, which average 
discrimination value ranks in excess of t/2 for t terms; 

b) the terms with high document frequencies exceeding n/10, located 
on the right-hand side of Fig* 4 are the worst discriminators , 
with average discrimination value ranks near t; 

c) the best discriminators are those whose document frequency is 
neither too high nor too low — with document frequencies 
between n/100 and n/lO for n documents; their average 
discrimixiation value ranks are generally below t/5» 

The output of Fig* 4 shows average discrimination value ranks only. 
Before deciding that all terms with low and high document frequencies can 
automatically be disregarded, it is useful to determine whether any good 
discriminators are in fact included in the corresponding low frequency aud high 
frequency term sets. Figs. 5(a) and 5(b) show sets of low frequency terms for 
the Medlars and Time collections respectively, tog( cher with the number of 
good discriminators — those with discrimination ranks between 1. and 100 — 
included in each set. Fig. 5 shows overlapping term sets, consisting of 
all terms with document frequency equal to 1, 1 and 2, 1 to 3^ etc., together 
with the percentage figures of the total number of terms represented by the 
corresponding sets. 

Thus when seventy percent of the terms are taken in iroreasing document 
frequency order — corresponding in the Medlars collection to about 3200 terms 
out of <+700 with document frequencies of 1 or 2, and in the Time collection to 
9900 terms out of lUOOO with document frequencies 1 to 3 — it is seen that 
only about 15 good discriminators are included for Medlars, and about 12 for 
Time. When the proportion of terms increases to eighty percent in increasing 
document frequency order, including 3800 Medlars terms, or 11300 Time terms. 
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ranging in document frequency from 1 to the number of good discriminators 
ri'.u J to 30 for Medlars and 35 for Time. When so few good terms are included 
amone, ae mass of low frequency terms, it is obvious that special provisions 
must be made in any indexing process for the utilization of these terms. 

Consider now the very high-frequency terms — those which according 
to the output of Fig. 4 exhibit the lowest discrimination values. While the 
number of such terms is not large, each of the terms accounts for a substantial 
portion of the total term assignments to the documents of a collection because 
of the high document frequency involved. 

The output of Fig. 6(a) for Medlars, and 6(b) for Time shows that about 
four percen of the high-frequency terms present in a document collection, 
accounts for forty to fifty percent of all term ass^g.-Tients, when the terms 
are taken in decreasing document frequency order. The absolute number of 
distinct terms is 200 approximately for the Medlars collection and about 500 
for Time. In each case, less than 15 of these terms are classified as good 
discriminators. When the proportion cr terms taken in high frequency order 
increases to six percent , accounting for H6 percent of the term assignments 
in Medlars, cind 57 percent for Time, the number of good discriminators increases 
to about 20 in each c.-i. 

The inforwtioa included in Figs. 5 and 6 is summarized in Table 1. 
In each case, certain cutoff percentages are given foe terms taken either in 
low document frequency or in high document frequency order. For each such 
percentage, the numher of good discriminators included in the corresponding 
term .;et i? stated for each of the three test collections. Thus wheii sixty 
^wcent of the terms aro taken in increasing document fi-equency order, not 
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a single good discriminator is included among the 1668 terms for the 
Cranfield collection; only 5 of the top l>0 terms, or 16 of the top 100, 
are present among the 3238 Medlars terms; finally, for Time 1, out of 
the top 50, or 11 of the top 100 are included among the first 8916 low 
frequency terms. 

The number of good discriminators included among the high lrequei)c> 
terms for the three collections is similarly low, as shown in the bottcan 
half of Table 1. 

The conclusion tj be reached from the data of Figs. 5 and 6 and 
of Table 1 is that very few good dircriminators are included among the 
bottom seventy percent, or among the top four percent when the terms 
included in a collection of documents are taken in increasing document 
frequency order. This fact is used to construct an indexing strategy 
in the remainder* of this study. 

A Strategy for Automatic Indexing 

Consider the graph of Fig. 7 in which the terms are once again 
arranged in increasing document frequency order. If the assumption is 
correct that the best terms for indexing purj - es are concentrated in the 
set whose document frequency is neither too high nor too low — the 
frequency being approximately between n/100 and n/10 — then the following 
term transformations should be undertaken: 

a) Terms whose document frequency lies between n/100 and n/10 
shouli be used for indexing purposes directly without any 
transformation; these terms include the vast majority of the 
good discriminators. 

ERIC ^ * 
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b) Terms whose document frequency is too high — above n/10 

comprise the worst discriminators.. These terms are too 
general in nature, or too broad, to per:nit proper discrimination 
among the documents; hence their use produces an unacceptable 
precision loss (it leads to the retrieval of too many items 
that are extraneous). These terras should be transformed into 
lower frequency terms — • right-to-left on the graph of rig. 7 — 
thereby enhancing the precision performance. 

c) Terras whose document frequency is too low — below n/100 

are so rare and specific that they cannot retrieve an acceptable 
proportion of the documents relevant to a given query; hence 
their use depresses the recall performance. These terms should 
be transformed into higher frequency terms — left -to-right 
on the grapl. of Fig. 7 — thereby enhancing the recall performance. 

It remains to describe the right-to-left and left-to-right transformations 
that raay be used to generate useful indexing vocabularies. The obvious way of 
transforming the .iigh frequency terms into lower frequency entities is to 
combine them into indexing phrases . In general, a phrase such as "programming 
language" exhibits a lower assignment frequency than either of the high 
frequency components "language" or "program". The summarv of Fig. 7 then 
indicates that: 

Indexing phrases should be constructed from high 
frequency single term components in order to enhanc ; 
the precision performance of the retrieval syst em. 

The other left-to-right transformation which is required for recall 
enhancing purposes is now equally obvious. Low frequency terms with somewhat 
similar properties, or meanings, can be combined into term classes, normally 
specified by a thesaurus of related terms, or synonym dictionary. When a 
single term is replaced for indexing purposes by a thesaurus class consisting 
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of several terras, the assignment frequency of the thesaurus class will in 
general exceed that of any of the coraponents included in the class. Thus: 

The main virtue of a thesaurus is its ability to group 
a number of low frequency terms into thesaurus classes ^ 
thereby enhancing the recall performance . 

A large number of different strategies is available for the generation 
of indexing phrases and i thesauruses. Consider first the criteria used 
for the formation of phrases. A phrase might be created Mhenev»2r two or 
more components cooccur in the same document, or query; or when they cooccur 
in the same paragraph, or sentence of a document; or when they occur in 
certain specified positions within the same sentences; or, finally, when 
they cooccur in certain specified positions in a text while exhibiting 
certain predetermined syntactical relationships. The methods needed to 
identify the indexing phrases attached to a given document or query may then 
range from quite simple (any pair of noncommon terms cooccurring in a 
document may represent a phrase) to quitt complex (the various phrase 
components must exhibit appropriate syntactical relationshipr and these 
relationships must be ascertained). [3] 

For present purposes, a compromise position is adopted which bypasses 
an expensive syntactic analysis system in favor of the following procedure: • 

a) phrases are defined by using query texts; 

b -jonmon function words are removed and a suffix deletion method is 
u.?ed to reduce the remaining query words to word stems; 

c) the remaining word stems are taken in pairs, and each pair defines 
a phrase provided that the distance in the text between the two 
phrase components does not exceed two (at most one intervening 
word o'jcurs between components), and provided that at least one 
of the coraponents of tach phrase is a high-frequency term; 
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d) phrases for which both components are identical are eliminatod; 

e) duplicate phrases, where all components match an already existing 
phrase are elminated. 

The texts of all documents are checked for the presence of any phrase thus 
defined from the query statements, and appropriate weights are assigned. 

The phrase formation process is illus-^rated in Fig. 8 for a query 
dealing with world affairs. It is seen that this query gives rise to eight 
distinct phrases with adjacent components, plus seven additional phrases 
for which the components are separated by one intervening word in the rec?.uced 
query text. 

It remains to determine an appropriate weight to be assigned to each 
phrase created by the foregoing process. Thus if terms p and q exhibit 
weights w. and w. , respectively in document i, corresponding, for 
example to the frequencies of occurrence of the respective terms in the 
document, the phrase consisting of components p and q might be assigned 
weight w^p^ defined as 

w. + w, 

w = -2:2 ^ . (5) 

ipq 2 

A somewhat more refined weighting method uses w^^^ in conjunction 
with an "inverse document frequency" (IDF) factor which gives higher weights 
to phrases that occur comparatively rarely in the collection. The original 
inverse document frequency (IDF) factor, introduced by Sparck Jones, was 
defined as [k] : 

IDFj^ = flogj n1 - fiogg djl +1, 
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where IDFj^ is the IDF factor for term k, and dj^ is the document freq- 
uency of term k in a collection of n documents. Clearly IDF^^ is 
large when dj^ is small, and becomes small as dj^ approaches n. 
By analogy, a phrase IDF factor may be defined as: 

log d + log d 
IDF = (log n 2 a.), (6) 

where d and d are the respective document frequencies of phrase 

It tL 

components p and q. 

In conformity with the composite weighting system of equation (U) 
which uses the product of term frequencies and discrimination values » a 
composite phrase weight W. for plirase pq in document i may then be 
defined as the product of the IDF factor and the average comporent weight 
(equations (5) and (6)): 

[log d^ + log d^'^ fw,.^ + w. *1 

In a retrieval environment, the phrases defined by the foregoing 
prc^edure may be used to replace the ox'^iginal phrase components that is, 
the original ccmponents may be removed from the document and query vectors 

before the phrase identifiers are added. Alternatively, phrase components 

• * 

may be used in addition to the single term components. For the 
experiments described in the next section, the former policy was used in 
that phrases are introduced replacing the original component terms.. 

Consider now the converse to the right-to-left phrase formation 
procesQ, namely the left to-right thesaurv^s oonstruction method. Here 

* As before , the weighting system of exin>ession (7) assigns high weights 
<d to phrases with highly weighted components in individual documents but 

ERJC with relatively low overall document frequency in the collection. 
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the notion is to use low frequency terms and to assemble them into ciaui-.oLi 

of terras replacing the origindl vector compononts. If d and d uw 

P 4 

the document frequencie.s of terms p and q respectively, the document 
frequency of he class which includes both p and q may be definod a:; 

^r^« = + d - d 
pq p q pq 

term q, and both p and q, respectively. In general D may be expected 

pq 

to be larger than either d^ or d^ individually. When m terms are 
included in a given term class, the document frequency of the class is 
defined simply as the number of documents in which at least one term assigned 
to that class appears. 

Term classes are often defined by a thesaurus, and a given thesaurus 
class normally includes terms that are sufficiently similar in meaning, 
or context, to make it reasonable to ignore their differences for indexing 
purposes, A great mai.y thesaurus construction procedures have been described 
in the literature including manual term grouping as well as fully automatic 
methods. [5,6,7,8] Among the latter are the so-called associative indexing 
procedures, where statistically associated terms are jointly assigned to the 
documents of a collection, and a variety of term clustering methods desinned 
to group into a common class those terms wM'-h exhibit simileir term assignments 
to the documents of a collection. 

For experimental purposes it may be sufficient to use existing manually 
constructed thesauruses for the tfiree test collections, and restricting the 
thesaurus to include o*.a.y classes whose document frequency does not exceed, 
a stated maximum. Such a thesaurus then effectively limits the number of 
aifeh- frequency terms than can appear in any class, and provides the left- 
to-right frequency transformation specified by the model of Fig. 7. The 
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weight with which a thesaurus class is assigned to a document or query vector 
may be defined as the average weight of the component terms originally 
present in that vector. 

* 

A frequency-restricted thesaurus such as the one described above may 
not specify classes that are completely identical with the term classes 
obtainable by initially using only the low frequency terms for a separate term 
clustering process; however the experimental recall-precision results may be 
expected to be close to those produced by an original thesaurus construction 
method. 

The recall-precision results obtained from the operations modelled in 
Fig. 7 ax^e examined in the next section. 

5. Experimental Results 

The right -to-left phrase formation process is designed to produce 
lower frequency entities from high frequency COTiponents, and vice versa for 
the left-to-right thesaurus grouping process. The data of Table 2 prove that 
the required frequency alterations are in fact obtaiu^sd by the two transformations 
for the test collections in use. 

Table 2(a) shows that the document frequency of the phrases is only 
about one third as large as the frequency of the individual components 
entering the phrase formation process. In Tcible 2(b) the reverse is seen 
tu be the case fox* the thesaurus concepts whose document frequency is one 
and a half times that of the individual thesaurus entries. If the model 
of Fig. 7 specifying ideal frequency characteristics for index terms is 
appropriate, considerably better recall and precision output should be 
obtainable with the transformed terms (phrases and thesaurus class 3s) than 
the originals. ^ . . 
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Table 2(b) 
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Detailed recall-precision output is contained in Tables 3 and 
and in the summary in Table 5 for the various indexing methods applied to 
the three test collections in aerodynamics, medicine, and world affairs. 
Performance figures comparing the standard term frequency weighting (f, .) 
for single terms k in documents i with the phrase process are shown 
in Table 0. The phrase procedure uses the normal single terms in addition 
to indexing phrases weighted in accordance with the formula of expression (7). 

Table 3 contains precision figures averaged over 24 user queries 
i'.7r each of the test collections at ten specified recall levels ranging in 
magnitude from 0.1 lo 1.0 in steps of 0.1. The percentage improvement in 
precision for the phrase process over the standard is also given at each 
recall level, together with an average improvement ranging from a high of 
39 percent for the Medlars collection to a low of 17 percent for Time. 

Table 4 contains output similar to that already shown in Table 3. 
However the data in Table apply to an indexing system using both left-to- 
right (thesaurus) and right-to-left (phrase) transformations. It is seen 
ft»om Table U that the thesaurus traiisformation adds an additional average 
improvement of 13 percent in precision for the Medlars collection; additional 
advantages are also obtained for the Cranfield and Time collections. 

The evaluation results are summarized ir. Table 5. It is seen that 
average precision values of approximately 0.70, 0.40, and 0.20 at high, 
medium, and low precision are transformed into average figures of 0.90, 
0.60 and 0.30 approximately when the discrimination properties of the terms 
are optimized. The retrieval results displayed in Tables 3, U, and 5 have not 
been surpassed by any manual or automatic indexing procedures previously 
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tried with sample document collections and user queries. Furthermore, 
because of the high average precision values produced by the indexing', 
theories described in this study, it is not likely that additional drastic 
Improvements in retrieval effectiveness are obtainable in the foreseeable 
furture. 
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Negative Dictionary Construction 
R. Crawford 

1. Introduction 

Effective information retrieval is based on the ability to 
provide an accurate description of each item and to be able to discimin- 
ate between the available information items. In the area of document 
retrieval, a set of words (terms) chosen from the subject area of the 
documents may be used to describe the documents. [1,2,3] If the set of 
words used to describe each document is chosen properly, then each 
document will have a description which is both accurate and unique in 
relation to the other documents. The document descriptions should 
reflect the same differences and similarities between documents as would 
be noticed by a reader of the original documents. 

Thus, for a collection of documents iu a particular subject area, 
two problems are apparent. First, a set of terras must be chosen for use 
in describing the dociaments in the collection. This set of chosen terms 
is called a dictionary . The process of selecting the set of terras is 
called dictionary construction . Second, specific terms from the 
dictionary must be selected for use in describing each document. This 
assignment of terms to describe documents is called context analysis or 
document indexing . Both dictionary construct ion and document indexing 
have been previously investigated, with both manual and automatic methods 
considered. A large degree of success has been found in using fully 
automatic procedures for document indexing. [3,4,5] Automatic dictionary 
construction has not proved so successful and it is this area which is 
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being presently investigated. 

Dictionary construction may be conveniently considered in terms 
of several specific areas. 

(i) NEGATIVE DICTIONARY CONSTRUCTION. 

The determination of which terms to exclude frr ■ the 
document indexing process. This is defined m. 
explicitly in the next section. 

(ii) WQPJ) STEMMING. 

Entries in a dictionary may be grouped according to stems 
by means of suffixing. This involves construction of 
both Word Form and Word Stem dictionaries. 

(iii) THESAURUS CONSTRUCTION. 

Terms having similar properties may be clustered to form 
a single dictionary entry. These may be hierarchical in 
structure and may be based on many different similarity 
properties. 

(iv) PHRASE DICTIONARIES. 

Words or concepts used frequently in combli-tation are 
identified. 

(v) DICTIONARY UPDATING. 

The dictionary for a dynamic document collection must also 
be dynamic. This involves updating of the dictionary as 
documents are added to or deleted from the collection, or 
as word usage changes. 

The remainder of this paper deals with the first of these areas; 
negative dictionary construction. Further background of this specific 
area is given in the following section. 
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2. Negative Dictionaries 
2,1 Common Words 

The words used in the text of a document may be divided into 
two classes, which might be described as "words important to the 
meaning of the document" and "words important only to the structur e 
of the document". For example, consider the phrase; 

"The role of the generality effort in retrieval system 
evaluation is assessed, ..." 

as used in a document in the field of information retrieval. Intuitively, 

the words in tMs phrase could be divided into the following two classes: 

"MEANING" "STRUCTURE" 
WORDS WORDS 



ROLE 


THE 


GENERALITY 


OF 


EFFECT 


IN 


RETRIEVAL 


IS 


SYSTEM 




EVALUATION 




ASSESSED 





Those words which are important only to the structure of the 
documents are called function words or common words. Those words which 
are important to the meaning of the sentence are called content words. If 
only function words are classified as common, then the process of negative 
dictionary construction is not difficult. However, a closer examination 
of the previous example reveals some further problems, indicating that the 
class of common words should possibly i>e expanded to include words other 
than function words. 



Consider the use of the word "retrieval" in the above example. 
Since this was taken from a document in the field of Information retrieval » 

some question may be raised as to the validity of using the word "retrieval" 
to describe any document. It might be expected that many, or even all 
of the documents in this collection would contain this word. Although 
using "retrieval" to index each document in which it occurs may contribute 
to the accuracy of the description of those documents, it will also serve 
to make distinguishing among those documents more difficult. For this 
reason, words which have a very high frequency of occurrence in a collection 
may be considered to be common words. 



"role". Although not strictly a function word, "role" would not appear 
crucial to the meaning of the phrase to the same extent as "generality" or 
"evaluation". In fact, the phrase could easily be reworded in several ways 
to eliminate completely the word "role". Thus in this collection, retrieval 
may be expedited by treating a word such as "role" as common because of its 
usage. In a collection dealing with the theatre, for example, "role" might 
in fact be a very important and meaningful word. 

Words which are classed as common because of their high frequency 
of occurrence in a particular collection, or because of their specific usage 
in the collection are called collect ion- specific common words . Thus, it 
is convenient to classify words in the following manner: 



Again examining the example given, consider the use of the word 



WORDS IN A 

DOCUMENT 
COLLECTION 




CONTENT WORDS 



COMMON WORDS 
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A classification such as this may be useful in constructing a 
negative dictionary using manual methods ♦ Manual construction of 
negative dictionaries is discussed further in Section 4« This model vs 
not^ however, as useful when automatic negative dictionary conntruction 
methods are considered. Thus, new approaches to the problem of 
classifying words in a collection have been considered, with the hope 
that classifications which may be stated in more precise mathematical 
terms may also be simpler to implement using automatic techniques* This 
notion is expanded in Section 5« 

2.2 The Negative Dictionary 

A negative dictionary is defined as a list of words who?e use is 
proscribed for content analysis purposes* Based on the classification 
outlined in the previous section, the negative dictionary for a collection 
is composed of those terms in the collection which are either function 
words or collect ion^specific common words. 

The importance of accurate construction of negative dictionaries 
has been demonstrated* Words which do not contribute to the effectiveness 
of the information retrieval process must be excluded* Bergmark [6] has 
shown that, in at least one case, the advantage of a thesaurus over a 
word stem dictionary was due to more accxirate determination of common 
words, rather than to the clustering of the terms* 

Negative dictionary construction is considered in detail in the 
following sections* Section 3 describes briefly the experimental procedures 
used, including the retrieval system, document collections, and evaluation 
methods. In Section ^ manual negative dictionary construction is described 
and 3ome retrie\;al results presented* Section 5 includes discussion of 



possible techniques to be used in automat ic construction of negative dic- 
tionaries. In Section 6, the techniques of Section 5 are incorporated into 
specific algorithms for use in negative dictionary construction. Several 
algorithms are tested for retrieval effectiveness and these results are 
presented. Finally the work is evaluated and conclusions are dravni in 
Section 7. 

3. Experimental Procedures 

The retrieval system, document collections, and evaluation parameters 
used in the experiments to follow are described briefly. 

3.1 The SMART System 

The SMART system is an automatic document retrieval system "designed 
for the exploration, testing, and measurement of proposed algorithms for 
document retrieval". [7] All the experiments discussed in the following 
sections were performed using the SMART system as the experimental base. 

As described by Williamson, 17] the SMART system takes documents 
and search requests in natural language, performs a fully-automatic content 
analysis of the texts, matches analyzed documents with analyzed search 
requests, and retrieves those stored items believed to be most similar to 
the queries. A description of the implementation of the SMART system may 
be found in [71. Reports of previous work done using the SMAKT system 
are numerous, including [8] and [9]. 

3.2 The £;xperimental Data Base 

Two document collections are chosen for use in the retrieval 
experiiiionto. The first l:j the Medlars collection of 1033 abstracts from 
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the field of medicine, along with 35 queries for which relevancy judgements 
were obtained. The second is the ophthalmology collection consisting of 
852 documents and 35 queries. Again, a set of relevancy decisions for each 
of the queries with each document was obtained. These collections are 
suitably large to yield valid results, yet are of a size which allows extensive 
experiments to be performed using the computer resources available. 

3.3 Evaluation Parameters 

Two principal measures have been chosen for use in evaluating the 
retrieval effectiveness of the methods being tested. [101 These measures 
are precision (?) and recall (R), which are defined as follows: 

p ^ number of relevant documents retrieved 
number of documents retrieved 

^ ^ number of relevant documents retrieved 

number of relevant documents in the collection 

Ttius, precision is the percentage of retrieved documents which are actually 

relevant, whereas recall is the percentage of relevant documents actually 

retrieved. In presenting retrieval results, these recall and precision values 

are averaged over ail search requests and displayed in the form of a graph. 

The performance of different methods is compared using these precision-recall 

graphs ♦ 

4. Manual Negative Dictionary Construction 

U.l Methods For Manual Negative Dictionary Construction 
Dictionaries and keyword lists used for content analysis purposes 
alwayr; include a negative dictionary. Thun, the dictionary construction 
procei;j; involves partitioning ehe terms in a collection into two sets of 



IV-8 



terms; those terms which are to be included in the indexing of the 
documents (the inclusion list), and those terms to be excluded from the 
indexing (the exclusion list). Generally, manual construction of a 
dictionary proceeds from either of two directions. In one case, the 
keyword list or inclusion list is selected from all the words in the col- 
lection. The remaining terms thus form the exclusion list or negative 
dictionary. In the other esse, the emphasis is placed on determining which 
terms to exclude from the indexing process and it is the negative dictionary 
which is constructed first. Thus the remaining terms in this case form 
the inclusion list. Although differing in description, these two approaches 
to dictionary construction are not too diverse. In each case, all distinct 
words in a collection are manually examined and a decision is made with 
regard to their usefulness in indexing the documents. 

There are interesting examples of experimental work involving 
dictionary construction using each of the above approaches. In work performed 
by Vaswani and Cameron, 111] a dictiona../ was constructed from a sample of 
1,648 abstracts. Initially, a list was constructed showing each word occurring 
in the sample, along with the number of times the word occurred. The method 
of then constructing the dictionary proceeded as follows: 

"The list was studied very carefully by three people, two 
of them being fairly familiar with the subject matter, who 
decided intuitively which words to retain in the system as 
keywords, all others being excluded from further consider- 
ation". 

Thus, the negative dictionary consisted of those terms "excluded from further 
consideration" . 

In document retrieval experiments done using the SMART system, the 
following procedure has been used for constructing negative dictionaries: [121 
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(i) A standard comnon wor<i list is prepared consisting of 
function words to be excluded from the dictionary; 

(ii) A concordance listing is generated for a sample of the 
document collection under consideration, giving the 
context and the total frequency of occurrence for each 
word; 

(iii) The common word list is extended by adding new non- 
significant words taken from the concordance listing; 
many of the words added to form the negative dictionary 
are either very high frequency words providing little 
discrimination in the subject area under consideration, 
or very low frequency words which produce few matches 
between queries and documents. 

The use of automatically generated aids such as concordance 
listings and word frequency counts has proved helpful during manual 
dictionary construction. Nevertheless, the construction process still 
involves an intellectual decision with regard to each term in the collec- 
tion, and many of these decisions must still be madi? somewhat intuitively, 

4.2 Manual Negative Dictiorary Construction-Performance Results 
Using the manual negative dictionary construction method outlined 
in the previous section, a negative dictionary was constructed for the 
Medlars collection. The remaining (non-excluded) terms were then processed 
three separate ways to produce the following three dictionaries: 

• (i) The Ml- word form (suffix- *s') dictionary, formed by stripping 
the final s from all terms; 

(il) The M2- word stem dictionary, formed by automatic removal of 
suffixes dii determined from a previously prepared standard 
suffix list; 
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(iii) The M3- thesaurus ^ or sjmonyTn dictionary, formed by manually 
grouping dictionary entries into synonym categories, or 
concept classes. 

A recall-precision graph showing the performance of these three dictionary 
types is given in Fig. 1. 

The performance of the M3 thesaurus is clearly better than that of 
either the Ml or M2 dictionaries, and may be attributed to a combination of 
accurate common word recognition and careful term clustering. The perfor- 
mance of the Ml word form and the M2 word stem dictionaries is quite 
similar; however, the Ml dictionary gives better performance results at 
all recall points except in the range of .10 to .30. 

Based on these resuuLts, consideration is given as to which dictionary 
type to use for testing of automatic negative dictionary construction methods. 
The simple word form dictionary type is selected for several reasons. First 
of all, use of a thesaurus presents both construction and analysis problems; 
it may be difficult to determine whether performance changes are due to 
common word recognition or to term clustering. Secondly, the performance 
of the word stem dictionary in the manual case gives no reason to select it 
over the word form dictionary. Finally, the word form dictionary is the 
simplest to construct. Therefore, the Ml word form dictionary is selected 
as the "control" dictionary for use in comparing the effectiveness of 
manual and automatic negative dictionary construction methods. 

5. Automatic Methods of Negative Dictionary Construction 

In approaching the problem of automatic negative dictionary con- 
struction, a first course of action may be to adhere closely to one of the 
algorithms used in manual negative dictionary construction, attempting to 
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automate each step of the manual process. Examples of the difficulties 
found when this approach is used may be seen in each of the manual methods 
outlined in the previous section. In the case of the method used by 
Vaswani and Cameron, the problem arises at the point at which the manual 
worker "decides intuitively" which terms to include in the dictionary. 
It is difficult to conceive an automatic procedure which will duplicate 
the intuitive decisions made by an individual. In the case of the manual 
negative dictionary construction procedure used on the SMART system, the 
problem of automation arises in the step involving examination of the 
concordance listing. Although a manual worker may with high consistency 
locate collection-specific common words by examining a concordance listing, 
this process does not yield directly to automatic methods. Based on these 
considerations, it is worthwhile to develop another approach to the problem 
of automating the negative dictionary construction process. 

The approach that is followed is to consider factors regarding terms 
in a collection which may be measured and evaluated objectively. Several 
factors are considered and tested. These are divided into the following 
three cu?eas: 

(i) frequency and distribution, 
(ii) discrimination value, 
(iii) distribution correlation. 

In tha following three sections, each of these areas is discussed in detail. 

5.1 Frequency and Distribution of Terms 

Thret? basic statistics are considered for use in determining 
which terms belong in the negative dictionary for a collection. These values. 
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which may be easily computed for each term in a collect ion > are: 
total frequency ^ which is the total number of occurrences of the 
term in the collection; document frequency » which is the number of 
documents in the collection in which the term occurs at least once; 
and average usage » which gives the average number of times the term 
is used within the documents in which it actually occurs (this is 
simply total frequency divided by document frequency). For dis- 
cussion purposes^ three separate areas are considered » in which 
these statistics are utilized. Terms of low frequency are discussed 
in Section terms of high frequency are discussed in Section 

5«1«2^ and a discussion of the average usage of terms is given in 
Section 5* 1.3. 

5*1.1 Low Frequency 

A consideration of terms with a low frequency of occurrence 
is important due to the fact that the majority of the terms in a 
collection occur only a very few times. For example, in the Medlars 
collection of 1033 medical abstracts, there are 1U,534 unique terms 
in the text. Table 1 lists the number and percentage of terms with 
specific low total frequencies in this collection. It can be seen from 
this table that words of total frequency one account for almost half the 
unique occurrences in the collection. 

Consideration is given to placing this large number of single 
occurrence terms on the exclusion list, and two advantages of doing so are 
noted. First of all, many of these terms are actually errors in the text, 
such as misspellings, improper hyphenation t etc. Excluding these terns 
causes elimination (but not correction) of these errors during the document 



TOTAL 
FREQUENCY 


NUMBER 
OF TERMS 


PERCENTAGE 
OF TERMS 


TOTAL 
OCCURRENCES 


1 


7,065 


48% 


7,065 


2 


2,073 


15% 


4,146 


3 


937 


6% 


2,811 


over 3 


4,459 


31% 


146,526 


TOTAL 


14,534 


100% 


160,548 



Number of Low Frequency Terms 
(Medlars Collection) 
Table 1 
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indexing process. Second of all, the size of the inclusion list is kept 
much smaller by excluding single frequency terms. Although this is an 
efficiency consideration, which may be considered as of lesser importance 
than retrieval effectiveness, maintaining a dictionary of a size which may 
b« hand Jed is an important factor. 

Regardless of these two advantages, it is the effect of terois of 
single occurrence on retrieval effectiveness which must be considered. 

Very low frequency words may be expected to produce few matches 
between queries and documents. 1131 It may be argued that the few matches 
which do occur will be important, and that low frequency words should 
therefore be retained. Excluding very low frequency terms may therefore 
cause a decrease in the level of precision of the retrieval results for 
some queries. 

For the two document collections used in this study, all terms 
occurring once in a collection were matched against the terms in the queries 
for that collection. In no case was there a match. All terms' having a total 
frequency of one could therefore be excluded without any resulting loss in 
precision of retrieval results. Because it is difficult to generalize these 
results to either the case of more queries in these present collections, or 
to the case of larger collections, it is clear that some compromise must be 
made regarding low frequency terms. This compromise is between retrieval 
effectiveness and retrieval efficiency . When terms of very low total 
frequency are excluded, the dictionary is kept smcill, increasing efficiency, 
but the level of precision may drop, indicating a decrease in retrieval 
effectiveness. The choice of a particular value of total frequency for use 
in excluding low frequency terms depenas on the levels of retrieval effectiveness 



and efficiency which are required. 

The effect of deleting terms of total frequency one from the Medlars 
collection is investigated experimentally. All o,718 terms of frequency 
one are deleted from the Ml word form dictionary to form the MA word form 
(no frequency one) dictionary. The comparative performance of the Ml and 
MA dictionaries is shown by means ot a precision-recall graph in Fig. 2. 
As discussed previously none of the deleted terms matched with any of the 
query terms, so the performance of the Ml and MA dictionaries should be 
similar. This is verified by the results shown in Fig. 2. Those slight 
changes which do occur are a result of the decrease in the lengths of the 
document vectors due to deletion of the low frequency terms. 

5.1.2 High Frequency 

Very high frequency words provide little or no discrimination in 
the subject area of the document collection under consideration. It may, 
therefore, be worthwhile for very high frequency words to be placed on the 
exclusion list. Excluding very high frequency terms from the indexing 
process may result in some decrease in the level of recall for certain 
queries. [13] That is, there may be some documents relevant to a particular 
query which are not retrieved by that query due to the exclusion of one or 
more high frequency terms which would have provided a match between the 
query and the documents. However, if high frequency terms are not excluded 
from the indexing process, then a query may match significantly with 
documents which are quite dissimilar. This may result in a low level of 
precision for certain queries. 
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Obviously, some compromise is necessary between the desire for 
high recall and the need for high preca.oion. It may be possible, however, 
to delete very high frequency terms so that precision increases, without 
affecting recall to a very great extent. What is desired is a function, 
based on either total frequency, or document frequency, or both, which 
will enable determination of those high frequency terms which belong on 
the negative exclusion list. 

Consider the list of words in Table 2. These are the terms ox* 
highest frequency in the Medlars collection of 1033 medical abstracts. 
The words are given in decreasing order of document frequency. By 
examining the total frequency given for each word, it is apparent that 
an ordering by total ^equency would have been quite different. In 
particular, words such as CELL and CASE would occur much higher on this 
list if it were ordered by total frequency. 

The three lines drawn through Table 2 indicate levels of document 
frequency of 20, 25, and 30 percent of total collection sise. All the 
terms occurring with a document frequency of 30% of collection size or 
greater are clearly function words and belong in the negative dictonary . 
for this collection. On the other hand, above a document frequency level 
of 20% of collection size there are several terms which are clearly not 
function words. At the document frequency level of 25% of collection 
size, only the word PATIENT is not a function word, and it is dear that 
the word PATIENT could easily be a collection-specific common word in a 
medical collection. 

Similar results to the above have been found for a collection of 
852 abstracts in the field of ophthalmology, in that a document frequency 

1 r. 'O 
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v/v O 


2,681 




2,673 


' \? ^ 


1 ,884 




1,580 
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814 


439 


654 


U36 


744 


431 


750 


417 


688 


412 


637 


405 


606 


393 


653 


343 


461 


328 


494 
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302 
268 
268 
258 
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253 
234 
233 
227 
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210 
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799 
388 
348 
435 
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521 
304 
307 
285 

327 
208 
785 
346 
369 
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ievei of 25% of collection size provided a point above which terms could 
be placed on the exclusion list with a high degree of confidence. 

Thus, terms with a document frequency greater than some chosen 
cutoff value should be placed in the negative dictionary. By choosing 
tr.is cutoff value properly, precision may be improved without any 
significant reduction in recall. 

5.1.3 Average Usage 

Tt is worthwhile to investigate whether* terms which belong in the 
negative dictionary may be distinguished by their average usage from terms 
which do not belong in the negative dictionary. In particular, it would 
seem reasonable that terms which are of importance in a collection may be 
used several times within the documents in which they occur, thus having 
a high average usage. On the other hand, function words might be expected 
to have a more random distribution, thus having a low or medium average 
usage. Average usage values were examined for over 6000 terms from a 
collection of 852 documents in the area of ophthalmology. 

Table 3 lists some of the terms from this collection, ordered by 
average usage. An examination of this list shows that content words, such 
as LASER and CYST, cannot be distinguished from common words, such as WAS 
and TO by means of average usage values. Average usage is therefore 
rejected for use in automatic negative dictionary construction. 

5.2 Discrimination Value 

A document collection may be defined as a distribution of terms 
taken from a specified set of terms. Thus, each term may be considered 
as a possible index term on the basis of its distribution in relation to 
the distribution of all other terms in the collectioi;. Developing this 
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Jdea» a term is considered to be a discriminator if its distribution is' 
such that it serves in distinguishing or discrimin.^' ing among tbe documents 
in the collection. A term which does not serve in distinguishing among 
the documents in the collection is a non-d i scr iminat or . For example, any 
term which occurs all the documents in a collection is a non-discriminator 
in that collection, as it may not be used to distinguish among the documents 
in any way. 

It is useful then, to define some function for terms in a collection 
which would indicate whether they are discriminators or non-discriminators. 
Such a function, the d i scr iminat iori value , is suggested, based on the doc- 
ument space similarity described by Aste-Tonsmann and Bonwit. [1^] Some 
notation and definitions are given and the discrimination value is derived 
in the following section. 

5.2.1 The Discrimination Value Function 

A collection of N documents is represented by a set of docximent 

vectors d., 1 < i < N. Each vector d. is of length m, where m is the 

number of terms used in indexing the documents in the collection. Then d.. 

t h th 

is the number of occurrences of the i term in the j document. Therefore, 

t h 

a value of d^^ = 0 indicates that the i term did not occur in document j. 
The oontroid, c, of a document collection is defined by £ = (c^, c^* •••» **ni^ 
where : 

= N ^ij • 

The centroid represents a tex-m by term average of the documents and is 
considered to be the center of the set of docuinents (i.e. of the document 
space). 
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A measure of the correlation of two vectors dj^^ and is given by the 
cosine function: 

cosCd. , d.) = 



2 

where (d^^ d^) is the .inner product and lldj^M ^ (d^^ ^) • For purposes of 
calculation, this is conveniently expressed as: 



where the sums are for i = 1 to m, the number of terms in the vectors. 

Now the compactness or document apace similarity , Q, is defined as: 

1 ^ 

Q = i J cos(c, d.) , 0 < Q <^ 1. (i) 

The value of Q is thus a function of the homogeneity of the documents in 
the collection and the set of terms uset^ in indexing the documents. Given 
a standard index language, a collection of documents from diverse subject 
areas will tend to have a low Q value, whereas a collection of documents 
on very similar topics will tend to have a higher Q value* However, for 
more homogeneous collections, the proper selection of index terms will 
reduce the compactness of the document vectors, enabling better discrimination 
and resulting in improved retrieval. 



(i) This is the "NORMALIZED Q" of Aste--Tonsman and Bonwit. It is a 
more convenient measure than their Q and has a well defined 
range. 

1 r i 

(ii) This may be maximized for a Rocchio centre id £* = U' ^ -- ■ " '■ 

and not for c as defined. However, this is I I 

negligible for docs of approximately the same 
length . 
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Since Q is a function of the document vectors, deleting a single 
term from all of the document vectors will normally change the value of Q. 
Essentially, deletion of this term represents a new index language, differ- 
ing from the initial one by only one term. The compactness of the collec- 
tion with term i deleted (i.e. d^j^^ = 0, 0\< j ;< N) is given by and is 
defined as: 

N 

1 



Q. = 4 J: cos(c^, dp 



where is the j^^ document vector with term i deleted, and £ is the 
centroid vector with term i deleted. 

Then (Q^ Q) is a measure of the change in document space compact- 
ness due to the deletion of term i. If Qjj^ > Q> the document space is 
more compact with term i deleted and term i is a discriminator in the 
collection. If < Q, the document space is less compact with term i 
deleted and term i is a non-discriminator in the collection. Since a 
particular value is only meaningful in comparison to the value of Q, 
a new measure is defined. The discrimination value of term i is 



defined as: 



Q. - Q 
1 Q 

0. has the following properties: 

^ 0 • term i is a non-discriminator 

(b) > 0 , term i is a discriminator 

(c) D*, term j is a better discriminator than 

^ term i. 

(d) D. is not an explicit function of collection size, 

^ allowing comparison of values of 

computed for a term i occurring 
in several collections. 
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The discrimination value thus provides a function by which all torms 
in a collection may be ranked, from greatest non-discriminator to bet-.t 
discriminator. 

It is suggested that for each document collection, a discrimination 
cutoff value exists such that all terms with a discrimination value below 
.this cutoff value should be placed in the negative dictionary for the 
collection. Only terras with a discrimination value greater than the chosen 
cutoff value are ust-d in indexing the documents. It is further suggested 
that the discrimination cutoff value for any collection will be strictly 
non-negative. That is, non-discriminators should always be placed in the 
n<»gative dictionary for a collection. For a discrimination cutoff value, 
D , a term i in a collection may be classified as follows: 

D. < 0 , ^ . . ^. . . 

1 ' term i is a non-discriminator 

0 < D. < D , . . 

1 — c • term i is a poor discriminator 

^ ^i • term i is a good discriminator. 
The effective use of discrimination value in an algorithm for 
negative dictionary construction is demonstrated in Section 6. The deter- 
mination of a discrimination cutoff value and its usefulness are also shown. 

5.2.2 The Set of Non-Discriminators 

Discussion in the previous section concerned the computation of 
discrimination value for specific terms in a collection. However^ a 
collection of documents includes a large number of terms, many of which bear 
relation to one another. Thus, conclusions which may be drawn for 
individual terms do not necessarily hold true for groups of terms. For a 
given set of terms, each of which is a non-discriminator, it must be 
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considered whether the set of terms also acts in a non-discriminary way« 
Stated simply for two terms in a collect ion » the question is as 
follows* When the effect of deleting term i alone is known » and the 
effect of deleting term j alone is known, what conclusions may be drawn 
regarding the effect of deleting both terms i and j? For example, 
consider two terms i and j such that < 0 and < 0; is it true 
that D^. < 0, where D^^ is defined as the discrimination value of the set 
of terms {i, j}» It can be shown that this is in fact true. That is, 
if terms i and j are non-discriminators, then they also act together 
in a non-discrimination way« 



THEOREM 1 Let K and L be terms in a document collection such that 

< 0 and < 0. Than < 0, 

PROOF The compactness of a collection with terms K and L 

deleted is defined as: 

N 



(5.1) 



1 N (c. d^) - cj^dj^. - c^a^^ 



_ i y • 

^ i=l ( I |c| 1^ - (4 + c2)) 1/2 ( I |d. 1 12 - (d^. + d^. )) 1/2 

The assumption may be made that the vectors are large compared to any one 
term. Therefore: 



2 ^ 2 



< < 1 



and 



2 2 



< < 1 



(5.2) 
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Thus 9 expanding the denominators as a binomial series: 

% J - u I (1 + •^^ ) (1 + )) 

• i=i IIeII Mill I llcli' ||d.||2 

1 N (o. d.) - c^d^. - c.dj_. _j^_2 



+ — = — + 



i=i 11=11 lliill • IIeII^ Il4i''^ 

* , )) (5.3) 

Id,' -2 



2 2 dropping the last term. 

" ' Hell" ||d,||2' 



1 °K 

Let: . = ? ( » t — ^) (5.if) 



2 ^2 
Ki 2 M.,,2 m2 



Then, 



1 N (c, d.) N (c, d.) N (c, d.) 



, = i ( I + I 8^. + I — 3.. 

^ i=ll|c|l lld.ll i=l llcll lldll i=l||cl| lld.ll 



i=l||c|| lldJl i::li|c|| I Id." 



^ i=l llcll lldJI llcll lid." 



. l—L^ 3,.+ l--^ (U3..; 

i=l||c|l lld.ll ^^i=ll|c|| lld.X 



N c.d.. 

l—^ 3^. (5.6) 

i^lllcll lld.ll 
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1 f-jAi_6 

''i^lllcll lliill 



= Q + (Qj^-Q) + (Q^-Q) - Rk^l ^^''^^ 



where 



° ii llsll 114,11 5^ A I |c| I 1 lij I 

Substracting Q from both sides and dividing by Q yields. 
Q " Q Q " Q 

Therefore : 

\l = \ ^ - (5.8) 

Since R^^ ^ is strictly positive and Q > 0, then Dj^^j^ < 0 Q.E.D. 

Having shovm that two terms which are non-discriminators also act 
together as a non-discriminating set in tb*» collection, it J«* a simple 
extension to prove a similar result for all non-discriminators in a 
collection. 

COROLLARY 1 

Lot the iiet of terms 

G = {s^, r,2, Jig, .... s^} be such that 

0 for all ieS and that 
D. < 0 for at least one ieS. Then 

D , defined a:; the discrimination value for the set of 



s 



terms, is negative; < 0. 



ERIC 1.-0 



IV-29 



PROOF 

Since all the equations used in theorem 1 apply for sets of terms 
as well as single terms, the corollary is easily proved by successive 
application of theorem !• 

Thus, it may be concluded that deleting the selL^ of all non- 
discriminators in a collection has the effect of making the collection less 
compact , 

Further properties of the discrimination value function are • 
examined in the next section. 

5*2.3 Analysis of the Discrimination Value Function 

It is of interest to consider further the properties of the discrimina- 
tion value function. In particular, it is important to determine whether 
the discrimination value provides any new information regarding terms in a 
collection, or if in fact the same information is obtainable from term 
frequencies and distributions. Thus, the relationships between the frequency, 
the distribution, and the discrimination value of a term are considered. 

Two approaches are used in investigating this area. First, 
theoretical consideration is given to the effect of various frequencies 
and distributions on the discrimination value. Second, experimental 
results are presented, demonstrating the relationships which do exist 
between discrimination value and the frequency and distribution of terms. 

Yu and Wong [15] investigated the compactness function, Q^, upon 
deletion of terms of--various frequencies and distributions. Although 
specific conclusions could not be made, some general results were given. 
These are as follows: 
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(i) Any term occurring in nearly all of the documents is a 

non-discriminator , irrespective of the number of occurrences 
within each document. (This is intuitive, however it is not 
trivial to show). 

(ii) For a collection of N documents, a term occurring in of 
the documents with a constant frequency, may be classified 
as a nnn -discriminator if the following inequality is 
satisfied: 




(i.e. ^ is large) 



(iii) A term occurring with a bunched up distribution, in only a 
few documents, is a non-discriminator. 

This third result (iii) may be invalid due to an assumption made 

by Yu and Wong that for a collection of N documents and m distinct terms, 

m 

It is doubtful if this condition is met by any existing collections. 

The Medlars collection was used to investigate experimentally the 
relationship between term frequencies and discrimination value. The 
discrimination value is computed for 6200 terms from this collection and 
these terms are then ordered by discrimination value. This ordered list 
is divided into 31 groups of 200 terms each. Thus the first group consists 
of the 200 terms with the highest discrimination values, and the last group 
contains the 200 terms with the lowe,st discrimination values. Averages 
are then computed giving the total frequency, document frequency, and 
average usage of each group of 200 terms. Figs. 3, 4, and 5 show these 
averages plotted in terms of the ordering by discrimination value. 
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Consider Fig. 3 for example. The left end of the graph shows that 
the first (0-200) group of 200 terms (those with highest discrimination 
value) had an average document frequency of 18. That is, the hest 
discriminator^ each occur in about 18 of the 1033 documents in the collection. 
Proceeding then from left to right across the graph, the figures indicate 
that as discrimination value decreases, so does document frequency, until 
those terms with the very lowest discrimination value ai'e reached. At 
this poini (5800), a large iump in document frequency occurs, with the 
group of 200 terms of lowest discrimination value having an average 
document frequency of about 100. The greatest difficulty in correlating 
discrimination value with document ft»equency arises with the next to last 
group of 200 terms of low discrimination value (5800-6000). In this case 
the average document frequency of the t-erms is almost 15, a very similar 
/alue to that of the best discriminators. Thus, very poor. and very good 
discriminators cannot be distinguished by means of document frequency. 

The results for total frequency given in Fig. H show a similar 
to the one just discusi»ed. Again there are terms at either 
end^of the discrimination value ordering which have similar total frequencies. 

Fig. 5, which shows the z .ationship between discrimination value 
and average usage, gives indication thai: this relationship may be fairly 
direct. The group of terms with highest discrimination values (0-200) 
shows an average usage of over 2.6, that is, when these good discriminator^; 
are used within a document, they are used several times. As the 
discrimination value decreases, the average usage of terms also tends to 
decrease, although not monotonically. The results shown in Fig. 5 indicate 
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that the average usage of a term may possibly be used to determine whether 
the term is a good discriminator or a non-discriminator. Since those 
results are averages, taken for groups of 200 terms, it is necessary to 
examine specific terms within the extreme' groups (those with highest 
and lowest discr uidnation values). Table U presents the average usage 
values for 15 good discriminators aud 15 non-discriminators, along with 
tneir document and total frequency values. These sample values are 
sufficient to show that discrimination value is apparently based on more 
than the frequency and distribution of a term. 

Both the theoretical and experimental analysis of discrimination 
value, and its relationship to term frequency and distribution lead to 
the same conclusion. The discrimination value of a term does not depend 
only on the frequency and distribution of that term, but it depends also 
on the frequencies and distributions of all other terms in the collection. 

5.3 Relative Distribution Correlation 

A measure is presented for possible use in determining objectively 
which terms to place in the negative dictionary. It i?^ hypothesized that 
content words and common words may be distinguished by "he distribution of 
the documents in which they occur. A common word tends to occur in a 
somewhat random favShion, and the documents in which it occurs are not 
expected to bear any relation to each other in subject matter. A content 
word, on the other hand, tends to occur in documents of fairly homogeneous 
subject matter. For example, the words JUST and PANCREAS each occur 10 
times in the Medlars documents • However, the documents in which they occur 
are distributed in the document space in a way exhibited in Fig. 5. The 
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docuHiiiats in which PANCREAS occurs are tightly grouped wliereas the 
documents in which JUST occu/'S are spread throughout the document space. 

A relative distribut ion correlation is suggested which indicates 
whether the documents in which a term occurs bear a strong or weak 
relationship. A term centroid, c^, is defined as the centroid vector for 
those documents in which term i occurs. 

d.ji<0 

For example. Fig. 6 shows the term centre ids £. and c for the terms JUST 
and PANCREAS respectively. Then, assuming that term i occurs in n 
documents, the relative distribution correlation for term i, R^^, is defined 
as: 

^i ' n ^ cos(c^,dj) 

So that* 

0 < R. < 1 . 

If the documents containing a term i are similar in content, then each of 
these documents correlates quite highly with the terra centroid, 
resulting in a value of R^^ which is close to 1. Conversely, if the documents 
in which term i uccurb are quite dissimilar, the value of R^ is close to 
0. Consider again the example given in Tig. 5* The distribution of the 
documents results in a low relative distribution correlation for the word 
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JUST but a somewhat higher correlation for the word PANCREAS. 

The use of relative distribution correlation is investigated 
experimentally using a set of 20 terms. These terms, along with their 
relative distribution correlations are given in Table 5, ranked in 
order of correlation. 

Several of these terms, such as ANTIGEN (rank U) and ANESTHESIA 
(rank H), are clearly content words. Other terms, such as WHOM (rank 6) 
and THOUGH (rank 16), are apparently common words. Yet this ranking of 
these terms by relative distribution correlation fails. to distinguish 
between comnion and concent words. It is therefore concluded that tht 
relative distribution relation does not provide information which is 
useful for purposes of negative dictionary construction, 

6. Experiments and Results 

An experimental procedure for testing of the negative dictionary 
construction methods already discussed is outlined in Fig, 7, Each node 
in Fig. 7 represents a dictionary. Each path between nodes of the tree 
represents additional term deletions from the parent node dictionary. 
Thus a path down any branch of the tree represents a series of successive 
term deletions (and therefore additions to the negative exclusion list ). 
The root node, Nl, represents the initial word list containing all distinct 
word tokens in the douoment collection. 

Table 6 shows the length of the dictionaries and the total length 
of the document vectors for each of the dictionaries tested, using the 
Medlars collection. Fig, 7 and Table 6 should be refftrrod to while 
reading the following discussion. 
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A5 






5,941 


39,142 


A6 






5,771 


36,960 



Dictionary Statistics 
(Medlars Collection) 

Table 6 



6*1 Manual Negative Dictionary Construction 
Each of the nodes in the left subtree in Fig, 7 (nodes Ml^ M2, 
M3, and MA) repres^ nts a dictionary formed using manual negative 
dictionary construction methods. The performance results for these dictionaries 
were presented and discussed in Sections H.2 and 5,1.1. For convenience^ 
these results are reproduced together in Fig. 8. Automatic negative dictionary 
construction methods which provide an equivalent or higher level of per- 
formance than these manually constructed dictionaries are desired. 

6.2 Automatic Negative Dictionary Construction 
Each of the nodes in the right subtree of Fig* 7 represents a 
dictionary formed using automatic methods. Two automatic operations are 
performed on the Nl word list to produce the Al automatic word form dictionary* 
First , all terms of frequency one are deleted from the Nl dictionary* The 
deletion of low frequency terms is discussed in Section 5.1.1. It is 
important to note that of the 7,245 terms which occur only once in the 
Medlars collect ion , none occur in any of the queries used. Thus, no word 
matches between queries and documents are lost due to deletion of these very 
low frequency terms. Second, the suffix 'S* is removed from all terms. Use 
of the simple word form dictionary allows accurate comparison of the negative 
dictionary construction methods tested, without necessitating consideration 
of other effects on retrieval, as would be necessary if a word stem 
dictionary or thesaurus were used* A comparison of the performance of the 
Ml -manual word form dictionary and of the Al -automatic word form dictionary 
L'i shown in Fig. 9. The performance of the Ml dictionary is superior to that 
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of the Al dictionary ) indicating the effect which manual deletion of 
common words can have on retrieval performance. However ^ automatic 
methods may he used to improve and refine the negative dictionary^ 
increasing the performance of the Al dictionary to a level above the 
performance of the Ml dictionary. 

As can be seen in Fig. 7, there are four leaf nodes in the subtree 
with root Al, indicating four procedures used in refining the negative 
dictionary. These four possible algorithms for negat i ve diet ionary con- 
struct ion are considered in the next four sections. 

6.2.1 Algorithm PI 

Proposed algorithm PI for negative dictionary construction is 
outlined in Fig. 10. This algorithm involves use of a standard common 
word list and the discrimination values of terms. Algorithm PI closely 
parallels the manual negative dictionary construction procedure described 
in Section H.l. In each case the attempt is to first locate the function 
words ) and then to determine the collect ion- specific common words. For 
this purpose, in both the manual and the PI algorithms , a standard common 
word list is used to determine the function words. However , in determining 
collection- specific common words, the manual procedxire involves manual 
examination of word context in a concordance, whereas the PI algorithm 
involves use of discrimination value. 

Algorithm PI is not fully automatic due to the necessity of manually 
constructing the standard common word list. However, this list may be 
constructed only once and retained as a data set for repeated use with each 
new collection processed. The use of a standard common word list may therefore 
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I considered as a semi-automatic method. For the Medlars collection, 
there are 256 words in the Al dictionary which occur on the standard 
common word list. These words are therefore deleted from the Al 
dictionary, producing the A2 dictionary. 

Using the document vectors derived from the A2 dictionary, the 
discrimination values for the 5,961 terms in the A2 dictionary are 
computed. There are 209 terms which are non-discriminators and these are 
deleted from the A2 dictionary to produce the A3 dictionary. Fig. ll 
shows the performance curves of the Ai, A2, and A3 dictionaries. The 
impr'oved performance of the dictionaries produced at successive steps 
of this algorithm is apparent. The importance of recog-^lzing and 
deleting fimction words is emphasized by the improvement between the Al 
and A2 dictionaries. However, the additional improvement of the A3 
dictionary shows the importance of determining collect ion- specific common 
words as well, and indicates the usefulness of discrimination value in 
doing so. 

The effectiveness of this semi-automatic algorithm for negative 
dictionary construction is judged by comparison with manual methods. 
Fig. 12 shows the performance of the A3 dictionary in comparison to that 
of the Ml manual word form dictionary. The Ml dictionary performed 
slightly better than the A3 dictionary at the high recall values of 
.95 and 1.00. At all other recall points, the A3 dictionary performed 
distinctly better than the Ml dictionary. The conclusion may be drawn 
that semi-automatic algorithm PI may be used to produce dictionaries 
which perform as effectively as manually constructed dictionaries in 
information retrieval. 
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6* 2,2 Algorithm P2 

Algorithm P2^ which is shown in Fig, 13 ^ involves a single 
automatic refinement of the Al dictionary. Testing of this algorithm 
on the Medlars collection shows that it is not a useful algorithm 
for automatic negative dictionary construction. The reason for this is 
ot interest, particularly as it provides insight into the discrimination 
value function. 

Computation of the discrimination values for each of the 6,226 
terms in the Al dictionary shows that only 7 terms are non-discriminators. 
These 7 non-discriminators each occur in over 73% of the documents, and 
their total combined frequency of 37,875 accounts for over 25% of the total 
occurrence of all terms in the collection. Thus, these are very high 
frequency terms ai.^ they have a dominating effect on the calculations 
necessary to compute discrimination values. In a sense, the collection 
is stable with respect to these seven terms* Deletion of any othex- 
single term has only a very minor effect on the compactness of the 
collection due to the dominance of these seven high frequency terms. Since 
tlie Al dictionary contains no terms of very low frequency (i,e* frequency 
one), the average frequency of all the terms is much higher than in the 
Nl word list. This probably is what causes the poor results produced by 
the P2 algorithm. It may be advisable then, to aelete very high frequency 
terms to offset the effect of deleting very low frequenny terms. 

The results here show that when the average frequency of the terms 
in a collection is greatly shifted, the discrimination values computed for 
tue remaining terms may not give a true indication of which terms ar<? 
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non-discriminators in the original collection. The algorithm discussed 
in the next section provides a solution to this difficulty. 

6.2.3 Algorithm P3 

Algorithm P3 for negative dictionary construction is shown in 
rig. 14. The Al dictionary is refined first ' deletion of all terros 
occurring in 25% or more of the documents, as discussed in Section 5.1.2. 
There are 30 such high frequency terms (see Table 2, Section 5.1.2) and 
these are deleted from the Al dictionary to produce the AH dictionary. 
Discrimination values are computed for the 6,196 terms in the AU dictionary. 
Some of the terms which are non-discriminators, along with some of the best 
discriminators in the collection are listed in Table 7. Altogether there 
are 255 non-discriminators in the AU dictionary, and these are deleted, 
forming the A5 dictionary. It should be clear upon examination of each 
step in Fig. m that the A5 dictionary is a fully automatically constructed 
dictionary. Essentially three values are specified in executing this 
algorithm. These are: the low frequency cutoff value, the high frequency 
cutoff value, and the discrimination cutoff value. The choice of these 
values has already been discussed; however, in the case of discrimination 
cutoff value, several values are considered and tested in the next section. 

Examination of the A5 dictionary shows that there are 170 words in 
it which are also on the standard common word list. It is intuitive that 
deletion of these function words from the A5 dictionary may increase its 
level of performance in retrieval. To test this, the AS dictionary is 
constructed, by deleting all terms from the A5 dictionary which occur on the 
standard common word list. 
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The performance carves for the Al» AU, A5^ and A6 dictionaries 
ax^ given in Fig. 15 • Possibly the most surprising result is the 
performance of the A4 dictionary compared to the Al dictionary, as these 
dictionaries differed by only 30 high frequency terms. The A4 dictionary 
represents the performance which is obtained by simple deletion of very high 
and very low frequency terms • The deletion of non-discriminators to form 
the A5 dictionary results in another significant increase in performance 
level. However , no significant gain results from further refinement of 
the A5 to produce the A6 dictionary. Thus, it may be concluded that the 
final semi-automatic step of deleting standard common words is not warranted. 
The suggested P3 algorithm is therefore concluded with the formation of the 
A5 dictionary. 

The result of this algorithm P3 (the A5 dictionary) is compared to 
the results of the algorithm Pi (the A3 dictionary) and the manual algorithm 
(the Ml dictionary) in Fig. 16. The performance of the A5 dictionary is 
distinctly better than that of the Ml dictionary, and slightly improved from 
the performance of the A3 dictionary. The fully automatic algorithm for 
negative dictionary construction, P3, is therefore superior to either the 
manual algoritlrims now in use, or the semi-automatic algorithm. Pi. 

6.2.H Algorithm PH 

In Section 5.2.1 the idea of a discrimination cutoff value, D was 
suggested, and the idea of the classification of terms into three areas; 
non-discriminators, poor discriminators, and good discriminators, was discussed. 
In the previous algorithms, only non-discriminators were deleted, that is^ a 
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discrimination cutoff value of zero was chosen. Using the PU algorithm 
shown in Fig. 17 the effect of deleting poor discriminators in addition 
to non-diocriminators is tested. This algorithm is identical to the P3 
algorithm discussed in the last section in its construction of the A5 
dictionary. In the algorithm, however, groups of terms are deleted 
from the A5 dictionary in increasing order of discrimination value, that is, 
the discrimination cutoff value is gradually increased. 

Table 8 shows the performance values for the dictionaries 

resulting from deletion of an increasing number of terms. These diction- 
aries are listed in decreasing order of the number of terms remaining in 
the dictionary (i.e. the number of terms to be used in indexing the doc-- 
uments and queries). Thus, the AS (59^0) is the original A5 dictionary 
with all 5940 terms retained » while the A5 (250) is the A5 dictionary with 
only the 250 best discriminatox^s remaining. The results shown in Table 8 
are displayed graphically in rig. 18. The results for the Al dictionary 
are included in both Table 8 and Fig. 18 for purposes of comparison. 

Analysis of Fig. 18 shows that in fact the classification of terms 
as good discriminators, poor discriminators, and non-discriminators is 
reasonable. The points at the right end of Fig. 18 show the large 
improvement in performance which occurs upon the deletion of non-discrim- 
inators. However, moving across the graphs from right to left, very 



(ill) These measures are described in [lb]. 
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.7149 


.1337 


.6500 


A5 (1500) 


.7625 


.7135 


.1343 


.6490 


A5 (1000) 


.7525 


.7066 


.1192 


.6425 


A5 (500) 


.7337 


.6466 


.0982 


.5883 


AS (250) 


.6586 


.4353 


.0640 


.4610 



Retrieval Performance Upon 
Successive Reduction of Index Terms by 
Increasing Discrimination Cutoff Value 
(Medlars Collection) 



Table 8 
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little change is noticed in any of the measures between the points for 
59U0 and 1000 index terms. Thus, almost 5000 terms are present in the 
collection which have little effect on retrieval results. These are the 
poor discriminators . From the point at which 1000 terms remain in the 
dictionary, further deletion of terms results in a decrease in each of the 
retrieval measures. Thus, these last 1000 terms, having the highest 
discrimination values, ar-^^ good discriminators . 

These results are clarified further by the displa, in Fig. 19 of 
performance curves for several of the dictionaries. The results shown 
in Fig. 19 may be summarized as follows: 

(i) Deletion of non-discriminators results in a significant 
increase in performance level (the change from Al to A5 
(59H0)). 

(ii) Deletion of poor discriminators has the effect of decreasing 
performance level very slightly (the change from A5 (5940) 
to A5 (1000)). 

(iii) Deletion of good discriminators casues a sharp decrease 

in performance level (the changes from A5 (lOOO) to A5 (500) 
to A5 (250)), 

Whether or not poor discriminator., are deleted from the dictionary depends 
on tfr.' constraints of the particular retrieval environment under considera- 
tion. If 4.>ecall is to be maximized, then the poor discriirinators munt not 
be deleted. However, if a small decrease in recall may be tolerated, 
then poor discriminators may be deleted, resulting in a significant deciease 
in dictionary size. 
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6.2. 5 Conclusions and Verification 

On the basis of the results presented, it is concluded that fully 
automatic methods may be used for constructing negative dictionaries 
which provide for a higher level of retrieval performance than do standard 
manual methods. In particular, the P4 algorithm is an efficient and 
effective algorithm for use in negative dictionary construction. In 
addition, it has been found that a dictionary of index terms may be greatly 
reduced in size by deletion of poor discriminators without affecting 
retrieval performance significantly. 

Since these conclusions are based on tests performed on a single 
collection, it is worthwhile to verify the results using another document 
collection. The Ophthalmology collection is used, with results being 
compared for three manually constructed dictionaries and one automatically 
constructed dictionary. The manually constructed dictionaries are a word 
form dictionary, a word stem dictionary, and a thesaurus. The automatically 
constructed dictionary is the A5 dictionary, constructed according to 
algorithm P4. There are 3762 terms (discriminators) in this A5 dictionary. 
Since there is adequate space available in the present utility for 
handling this dictionary, no deletion of poor discriminators is necessary. 

The performance of the manual word form, manual word stem, manual 
thesaurus, and automatic A5 dictionaries is shown in Fig. 20. The A5 
dictionary performs better at all recall values than each of the manually 
constructed dictionaries. Thus, the conclusion that automatic negative 
dictionary construction methods are effective, is substantiated. 
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?• Summary and Conclusions 

Negative dictionary construction algorithms are described in this 
study that use manual, semi-automatic, and fully automatic methods to 
determine common words. The intention is to show that dictionaries 
constructed by fully automatic methods perform equivalent to or better 
than dictionaries formed manually* Experimental evidence from two 
document collections indicates that automatic negative dictionary con- 
struction methods are more effective than standard manual negative 
dictionary construction methods, and are therefore to be preferred. 

In addition to the general results regarding negative dictionary 
construction, important results are found regarding the discrimination 
value function. The usefulness of discrimination value is verified both 
theoretically and experimentally, and its properties are more fully 
explored. The relationships of words in a collection are examined, in 
terms of frequency, distribution, and discrimination value. Finally, 
experimental results are given which show that dictionary size may be 
greatly reduced retaining only good discriminators, while maintaining 
retrieval performance. 
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Dynamically versus Statically Obtained 
Infornnation Values 
A. van der Meulen 

Abstract 

An evaluation is made of the effectiveness of dynamically 
updated parameters, knovm as "information values", and a comparison 
is made with a statistical approach in which parameter values are 
computed once rather than being continually changed • 

Information values are quantities which may be assigned to 
dictionary items in order to reflect the descriptive power of the 
various keywords* The main objective of the usage of Information 
values is the improvement of system performance by taking into 
account the useftilness of the index terms in content analysis. 
Simultaneously a ifleans of control over the index vocabulary is 
obtained, since an information value displays the quality of 
its associated kejword. 

The procedure of assigning information values is based on 
the philosophy that keywords which help to retrieve relevant 
documents are more valuable than those which participate in 
retrieving nonrelevant documents. In particular the utilization 
of user relevancy decisions is examined in two different ways. 
First, the collected data reflecting the retrieval history of a 
system is us^ed to compute information values in a rather statistical 
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way. Second ^ each retrieval result is used on its own to change and 
update the current infoi^mation values. 



1. Introduction 

The* investigations described in t\is report may be regarded as 
an extension of earlier work done with information values [!]• It is 
suggested there that a realistic dynamic updating be carried out rather 
than the simplified simulation used earlier which will be referred as 
the "static" updating approach. The process of using the retrieval 
results of each previous search to improve the results of the next will 
be called dynamic updating. 

A known exampl of dynamic system updating is the "dynamic 
document modification procedure". 12] Imperfect document indexes can be 
improved by utilizing the retrieval results as evaluated by the user 
(feedback). After each search the indexes of the retrieved documents 
are slightly changed in such a way that a new user who wants similar 
Information will find his relevant documents ranked higher and his 
retrieved nonrelevant documents ranked lower. One might wonder whether 
this process of continuous index modifications is stable in that finally^ 
after intensive system use, the system performance reaches an optimum; 
this question has not been investigated yet for dynamic document space 
modification. 

Another type of dynamic updating is proposed for the so-called 
information values, quantities which give a numeric value for the quality 
(discrimination power) of the identifiers (kej^ords) used in the indexing 
process. In previous experiments tl] with these valuer a statistical 



approach is used for practical reasons. It differs from the dynamic 
strategy in that the retrieval results are stored for a whole set 
of queries from which in turn the information values are computed. 
In th^i dynamic approach, however, the retrieval results for eacli 
individual query are affected by the results obtained for the 
previous one, since the updating occurs continuously (after each 
search ) . 

2. Information Values and their Derivation 
A. The Concept 

In regular relevance feedback, the user judges the retrieved 
documents as either relevant or not relevant to his query* This 
information is then fed back into the system and used to redefine 
the query for subsequent '^^earch iterations. These relevance 
judgments may also be used automatically to construct a weighted 
dictionary, as is explained in the next section. 

Specifically, with each identifier one can associate a 
numerical value, herein referred to as the inforration value . 
This information value is initially set equal to one, and is changed 
only when the corresponding concept occurs both in the query and in 
a retrieved document. The information value is then increased if 
that retrieved document is declared relevant, or else it is 
decreased. In the indexing process^ these information values may 
be used in addition to the existing identifier fi*equency weights; 
that is, each vector-weight is formed as the product of both. 



This will result in document and query vectors in which the promin<*nt 
descriptors are emphasi::ed, while the lesser ones are suppressed* 

The whole procedure is simple to implement. An attempt is thus 
made to determine whether the process leads to an impxx>vement in system 
performance — that is, a lilting of the recall-^precision curve* 

The Updating Method 

Initially, the information value of each concept is set equal 
to one. From that point the value is increased (or decreased) each 
time the corresponding concept co-occurs both in a query and the 
relevant (or nonrelevant) documents retrieved in response to that query. 

The specific increment-decrement function chosen is the one 

proposed by Sage 13] , herein referred to as the "sine- function'* because 

of its resemblance to the regular sine. If i is an identifier which 

co-occurs both in a query and in its associated retrieved document, define 

V. - the information value of identifier i 
1 

(initially set equal to one) 

Vj3r«= the information value of identifier i 
after updating 

X. s arc sin (v. - 1), the transposed information value. 
Then, v. = 1 + sin (x.) 

and similarly vj^ is calculated by 

v.*= 1 + sin (x. ±^ Ax.) (1) 
where Ax is a function of the old information value, calculated as 

Ax. = — 2 (2) 
where C is an arbitrary constant, set eqaul to 8 in these experiments. 
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Ax^ is added in equation (1) if the retrieved document 
containing term i is judged relevant by the user; or it is 
subtracted in the information value calculation if the corresponding 
document is judged nonrelevant, 

C. Dynamic versus Static Updating 

The information values are increased as long as their 
identifiers occur in relevant retrieved documents^ but are decreased 
in response to nonrelevant retrieval • Each new query is transformed 
by the weighted dictionary in the following way: 

where w^ is the regular frequency weight of identifiei^ i and v^ 
the corresponding information value^ Notice also that new incoming 
documents will be indexed in a similar way, using the weighted dictionary, 

An intuitive explanation for the existen of an equilibrium 
value follows* Consider again identifier i and the correspond I rig 
information value v^. The similarity measure between the query 
and the docuraent vector is computed as the cosine of the anglo between 
them. As a result of each relevant retrieval v, is updated* 
Consequently the term yields a greater influence on the direct ion 
of the corresponding vector in the index-space. Sincf id'^^ntlfier i 
is emphasized, more documents containing that idontifier nt^e 
retrieved when a larger correlation coefficient is obtained. 

As long as this promoted identifier is successful in retrieving 
relevant documents, the corresponding information value will ho 
increased* If, however, nonrelevant documents are retrieved because 
of an excessively high value for a given concept, then the self- 
correcting feedback mechanism comes into play and the information value 
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will be set back agaii* 

The assumption is that in this way each identifier eventually 
achieves an equilibrium va?ue which is characteristic of its descriptive 
power. 

The mechanism described above defines a dynamic dictionary update 
procedure » wherein the dictionary is modified after each query search. 
In thi'i situation each query is indexed only after the dictionary has 
been altered in response to all previously searched queries* 

The first experiment with information values uses labor saving 
batching techniques. The document collection is searched by a batch of 
queries indexed with the unmodified dictionary. Sufficient information 
is stored to allow updating the information values on the basis of the 
two highest raided retrieved documents associated with each query. 

This procedure » however, may exaggerate the information values 
obtained, since the self-correcting capacity of the feedback process 
described earlier is never invoked. Specifically, a good or bad term 
will be continually changed based on its retrieval effectiveness using 
its initial information value, rather than its retrieval effectiveness 
using its perhaps modified information value. Nevertheless, since a 
relatively small number of updates normally occurs, it seems plausible 
that these values may still give a good indication of the information 
value of the associated identifiers. 

It is interesting to draw a comparison between dynamic and 
static updating; the more realistic dynamic approach is simulated within 
the SMART system. Details of the dynamic experiments are de:;cribed in 
the next section. 
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3^ The Experiments 

Information values a^e changed according to an algorithm which 
takes into account the co-occurrence of concepts in che rH?trieved 
documents and the associated query* Information values are increased 
if the retrieved documents pre relevant and decreased otherwise. 
The dynamic updating is done immediately after each query *bf the 
query update collection has been searched • Updated informal' ion 
values are used in indexing the next query, and the retrieval results 
again modify the information values. 

For the purpose of comparing the static and the dynamic 
approach J two dynamic experiments are carried out as follows: 

a) Information values are dynamically obtained using 200 
queries belonging to the Cranfield 1400th document 
collection. The information values obtained are then 
used to reindex a test collection of 25 queries. This 
query test collection is run as a batch , and retrieval 
results obtained will therefore be called semi-dynamic. 

b) The fully dynamic approach is used in w'lich the 25 
queries of the test collection are also run dynamically 
in the same way as the queries of the update collection. 

U. The Results 

Before discussing the comparative results of the dynamic 
versus the static approach, the latter will be examined first. 
Results obtained earlier with the static approach were not 
satisfactory [1] and after similar experiments have been done with 
so-called utility values [4J , using the same updating function. 
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the reasons for the failure become clearer. They are the following: 

i < 'he obviously wrong choice of a factor in the updating 
eilgorithm which produces increment and decrement steps 
that are too large. 

b) The use of nonoptimally chos6i. initial (starting) 
information values. Instead of initializing all informa- 
tion values equal to one (reducing equation (3) to only 
frequency weights) one could have taken advantage of the 
available discrimination values for this collection. By 
using the discrimination values as starting values for 

the dynamic information values, a better system performance 
can immediately be obtained. In this case it will be 
necQf .ary to scale the discrimination values in such a way 
that the average equals 1 (see c). 

c) No provision is made for balancing. Balancing means keeping 
the average information value equal to 1 in order not to 
promote or suppress terms not yet updated. In the experiments 
done with utility values this is shown to be £ critical factor, 

d) The query test collection was not chosen randomly and after 
the earlier experiments 11 J were done, it was found that 
the test collection was related to one specific subset of 
the document collection. This means that keywords specific 
for that part'.c olar field are not likely to be updated by 
the updatii*^: '/.i^ry collection which concerns other topics. 

Since the same defects have to be valid for the dynamic strategy, 
one may not expect an absolute improvement of the system performance, 
but rather a relative improvement over the statistically obtained informa- 
tion va^uor; is anticipated. 



The results of the present experiment A (Fig. 1) indicate 
that running the test collection in a batch using informatior 
values dynamically obtained from the \4)dating collection gives 
better results than when the static values are used. The 
improvement is slight though , because the test collection has 
unfortunately not been chosen randomly. Since the query test collection 
is directed to a specific field and thus utilizes keywords describing 
this field, the associated information values are not likely to have 
been updated by the update collection in which they did not occur. 

Much more interesting therefore is experiment where the 
query test collection itself is used in a dynamic way (Fig. 2). 
Here the concepts of each subsequent query are updated by the previous 
one. 

If the dynamic approach really works as it should, that is, 
if the good properties of the feedback mechanism mentioned earlier are 
utilized, this experiment should show clearly the dynamic effects. 
Fig. 2 shows indeed that system performance is improved considerably 
compared with both the static and the semi-dynamic nase. 

5. Conclusiou 

For a given collection and a given updating algorithm the 
static and dynamic updating strategy have been compared. The static 
information values were c^tained in an earlier project and did not 
perform very well, partly because of defects in the updating algorithm 
discovered later. In order to draw a fair comparison between statically 
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and dynamically obtained information values all conditions are kept the 
same* 

The results show that in the two dynamic experiments carried out, 
dynamic updating is superior to static updating. The fij^^st experiment 
is a mixture of dynamic and static updating in that the testcollect is 
processed as a batch; the second experiment is fully dynamic and the 
results of this experiment show in particular a considerable improvement 
over the static strategy* 
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Automatic Thesaurun Construction Through The 
Use of Pxe-Defined Relevance Judgments 
Kenneth Welles 

Abstract 

A method for totally automatic construction of thesauri 
is proposed which relies upon accumulation of match-mis-match 
influences to converge • iiesults so far, while not conclusive, 
are promising. 

!• Introduction 

It is now an accepted fact that the use of term class- 
ification thesauri iir.prove the operation of information storage 
and retrieval systems ♦ There are many different approaches to 
the construction of such a thesaurus, ranging from completely 
by hand to totally automatic ♦ This paper deals with a proposed 
system of creating a thesaurus totally automatically • 

Automatic term classification algorithms have mainly 
centered on two areas, term co-occurrence as an indication of 
synonymy, and predetermined relevance judgments as a source of 
information about synonymy. My work deals with the latter 
aspect. The basis for such a system is a set of documents, a 
set of requests, and a oet of judgments as to which documents 
should and should not be retrieved by each request. From this 
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information, the machine can construct a set of classifications of terms 
(synonym classes) which improve the retrieval effectiveness of the 
given queries substantially. The rationale behind such a construction 
is this: if the set of requests was comprehensive, then any similar 
requests presented in the future will also benefit from the resulting 
thesaurus. 

The starting point in this area is a paper by Jackson of the 
construction of precisely such a system. [1] A program was written to 
the specifications of Jackson's paper. Early in the project it was 
discovered that a minimal collection (about 80% of the ADI collection 
of documents and queries) requires more than 30 minutes to run on the 
360-65 without attaining the first of several iterations. This is 
obviously an impractical program to implement. 

Examination of the program shows that a substantial portion of 
the run time is spent on constructing, maintaining, and observing what 
Jackson calls "degeneracy conditions", which assure convergence. A 
different approach is proposed in this paper. In order to decide which 
terms are to be considered synonymous, each pair of terms is examined. 
If these two terms are considered synonymous, then any query-document 
pair where one term is in the query and the other term is in the 
document, will have this term pair counted as a match. This increases 
the number of matching terms in this query-document pair, and so 
increases the calculated matching coefficient for this query-document 
pair. If the que- y- document pair is defined as relevant, then this 
increase in roatciing is good. If the query-document pair is defined as 
not relevant, then this increase is bad (for the purposes of constructing 
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a thesaurus which causes the calculated matches to agree with the 
defined relevancies). If the amount of good outweights the bad, 
the pair Is considered synonymous, and when all such synonym pairn 
have been considered, the query-document -et is re-examined for 
the next iteration. Hopefully, after several iterations, tlie set 
of synonym pairs will stabilize in a way that gives the calculated 
relevance of the query-document set closest resemblance to the 
defined relevance set. 

2. Terminology ci>d Definitions 

Before continuing with the technical aspects of the proposed 
program, some definitions are necessary. There are four possible 
term classes, and for any given query, document and term, the term 
falls into one of these classes (see figure 1). If the term is not 
present in either the query or the document, it is class A. If the 
term is in the query, but not in the doctnnent, it is in class B. If 
the teiTO is in the document but not in the query, it is in class C. 
Finally, if the term is in both query and document, then the term is 
in class D. 

All correlations betweei queries and documents are calculated 
during each progi^ara iteration (since the correlations change from one 
iteration to the next). For each query, the correlation values of the 
defined relevant and defined not relevant documents are considered. 
The lowest correlation value of a defined relevant document and the 
highest correlation value of a defined not rej.evant document are 
determined. A value midway between these values is taken as the 
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match value (see figure 2). If any query-document pair exhibits a 
correlation higher than this match value, then this pair is said to 
be calculated relevant, if the correlation is less than or equal to 
thifi match value, the pair is said to be calculated not relevant. 
i:ach query-dociament pair falls into one of four classes (see 
figure 3). If the pair is calculated not relevant and defined not 
relevant (by the initial relevance data), then it is in class R. If 
the pair is calculated relevant, but defined not relevant, it is in 
class S. If the pair is calculated not relevant, but defln^-: relevant, 
it is in class T. If the pair is defined and calculated relevant, it 
is in class U. 

If all the query- document pairs are either class R or class U, 
then the calculated matches all correspond to the defined matches and 
no further modifications are needed. However, if there exist query- 
document pairs in classes S or T, the terra space must be modified to 
cause calculations to agree with definitions. 

The cosine correlation is used to calculate the match between a 
query and document. If the possibility of synonyms is ignored, this 
value becomes: 

D 

B+C+D 

where B, C and D are the number cf terms in classes B, C and D 
respectively. However, if any term in class B is a synonym of any term in 
class C, then this pair of terms is considered a match, and is counted as 
a class D term instead of a class B term. All term pairs with one class B 
term and one class C term are considered "potential synonym pairs" for this 
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particular query-document pair. It is seen here that if many of these 
"potential synonym pairs" are, indeed, considered synonymous, then 
the correlation between query and document can be raised considerably. 
It should also be noted that any synonym pair may cause an increase 
in many different que --document correlations, some desirable, and 
sane undesirable. 

As an example, take the query and two docimients in table 1(a). 
In table 1(b) we see that both possible query-document pairs have only 
one exact term match, and each also has four potential synonym pairs. 
In table 1(c) we see the effect on the cosine correlation of these 
query-document pairs if different potential synonym pairs are consid- 
ered synonymous. If no synonjrms are con^iidered, both query-document 
pairs have the same correlation. However, if RED and GREEN or BALL and 
BAT are considered synonymous, then the correlation of query-document 
pair A is raised but that of query- document pair B is not. If BALL 
and GREEN are considered synonymous, the correlation of both query- 
document pairs are raised. Also, if BLUE-CHILD is synonymous, then 
the correlation of query-document pair B is raised, and that of query- 
document pair A is not. 

We can now see that proper choice of synonyms allows a great 
deal of manipulation of matching coefficients. If. ''or instance, we 
had defined query-document pair A as relevant, and query-document pair 
B as not relevant, then RED-GREEN, or BAT-BALL would be good potential 
synonym pairs to consider synonymous, while BALL-GF.£EN or BALL-CHILD 
would not* 
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3. Pseudo-Classification Procedure 

A term matrix is set up which has a numerical value for jach 
term pair. If this value exceeda a user-defined "threshold of 
synonymy", then the corresponding pair of terms is considered synon- 
ymous (for use in calculating the matching function). The term 
matrix is initially set to all zeroes (no synonyms). 

At the start of each iteration, all query-document cosine 
correlations are calculated and stored. The correlations are calculated 
not only with direct term matches, but also counting any pair of terms 
which is a "potential synonym pair" for this query and document, and 
which is defined synonyraoxas by the term matrix. Since the entries in 
the term matrix vary from one iteration to the next, the calculated 
correlations will also vary. 

After all tlte correlations have been stored, each query- 
document pair is again considered in tium, and the previously stored 
correlation value and the query-document class (R, S, T or U) to 
which this pair corresponds is used to calculate a nxanber. This 
number, which may be positive or negative, is the "term modifier" 
and is added to each entry in the term matrix which corresponds to 
a "potential synonym pair" for this particular query-document pa Jr. 

After this calculation, the number of query- document pairs 
in classes S and T is counted and output as an indication of the 
degree of convergence to the desired state. The modified term matrix 
is then utilized in the next iteration for calculation of the matching 
function. Iteration continues until the program converges (all 
query-document pairs are in class R or U) or operator intervention 
occurs. 
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The heart of the program is the action of the term modifier. 
Any one term pair may be a "potential synonym pair" for many 
different query- document pairs. Thus, after each iteration, the 
corresponding term matrix entry will be changed by an amount equal 
to the sum of the term modifiers for all query-document pairs which 
include this term pair as a "potential synonym pair." 

If this set of query-document pairs consists entirely of class 
T pairs, then one would wish to raise the correlation of the pairs so 
that they might become class U pairs. If the modifier is a positive 
number for a class T query-document pair, then the net effect of many 
class T pairs will be to raise the term matrix entry above the 
'.'i'rsshold of synonymy. This, in turn, would increase the correlation 
of the query-document pairs, which is the desired r«^sult. 

Conversely, if we are presented with class S query- document 
pairs, we wish to lower the correlation by causing synonym pairs 
which contribute to the correlation to become "potential synonym 
pairs" below the "threshold of synonymy," i.e., to cause them to 
no longer be sjmonymous. If the term modifier is negative for class 
S query-document pairs, it will have the desired effect. 

Query-document pairs of the S or T classes are called 
mismatched pairs because the calculated results do not agree with 
the defined relevancies. The degree of mismatch is the amount that 
the calculated correlation differs from the match value. If the 
degree of mismatch is large, then many "potential synonym pairs" which 
are (or are not) synonyms must be modified until they are not (or are) 
synonyms. If a term modifier is made large in magnitude, then it 
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will have <i greatf?r effect on the t&cm matrix than other smaller term 
modifiers. This causes (on the average) a greater amount of change 
in the number of "potential synonym pairs" which actually change 
status to or from synonymy. It is therefore desirable that the term 
modifier should vary in magnitude with the degree of mismatch in 
classes S and T. 

At first it would seem that, because query-document pairs in 
classes R and U need not be changed, the corresponding term modifiers 
should be zero. This was tried and found to cause oscillation and 
prevent convergence. The reason is that the correctly matched pairs 
do nothing to maintain their status quo. When the mismatched query- 
document pairs modify the term space to correct their own mismatch, 
they disturb tne balance of synonymy in correctly matched pairs. 
To prevent the reg^iltant oscillation of query-document pairs between 
classes R and S, and classes T and U, it is necessary to give term 
modifiers of classes R and U small negative and positive values 
respectively. The term modifier of class R should be proportional 
to, minus the number of pairs in class T and U. The term modifier 
of class U should be proportional to the number of pairs in classes 
R and S. This assures stability of solution in the fully or nearly 
fully convergent case (almost all pairs in clasr, R or U). 

To assure continuity of the term modifier in class S, the 
term modifier is equal to the terra modif i ^i-" of class R (a constant) 
minus the absolute value of the mismatch of the class S query- 
document pair. Similarly, the class T term modifier is equal to the 
value of the class U term modifier plus the absolute value of the 
mismatch. 
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Programming System 

A sample program as implemented in FORTRAN IV is shown 
in the appendix* as an example of this algorithm. All arithmetic 
is performed with integer and binary variables and arrays. Binary 
variables are treated as integers with value 0 or 1. 

In section A, the document vectors, query vectors, and 
relevance judgments are read in. This is the main data. The 
term-term matrix end synonym matrix are zeroad, and initial values 
for all other variables are set up. 

In section B, the correlation coefficient is calculated, 
taking into account (statements 500 and on) any synonym term 
matches as well as direct term Tnatches. The calculated correlation 
of query (J) and document (I) is stored in correlation (J, I). 

In section C, the dynamic matching threshold is calculated 
for each query, and stored in matchvalue (J). 

In section D, each query-docimient pair is considered in turn. 
The "term-modifier" (variable name is MODIFIER) is calculated from 
the relevance judgment (relevant (J, I)) and calculated correlation 
(correlation (J, I)) of the pair, and the dynamic match value 
(matchvalue (J)) of this query. 

In section E, all term pairs are examined, and all which are 
"potential synonym pairs" for this query document pair (termterm 
(K1,K2)) are modified by the "term modifier". 

After Sections D and E have been completed for all query- 
document pairs, the modified term-term matrix (termterm (K1,K2) is 
examined in section F. From the data in this matrix, the synonym 
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P«,, H.<THTn: v^HOGBArt CONSTRUCTS A IHKSAURUS OF TEPMS «aU70«ATl CAi-LY* 
C-fc>.-'.".A S£T Of "b** CIrF£HENT L0CLWEN7S AAi 0 *t\** DIFFCREUT QUERIES 
(U--^-:tARE UiSlD AS C.-TA. ?.PC.H DOCUMHlaT OR QUESi' M.^Y HAVE 
Cu.*^-«<ANY OR ALL OF "N" rtFFERENT TEBMS? PRESCAIT ( PIWASY WEI GHTIISiG) . 
(:>«-,kfe|>ATA FOR CSrcISIOMS IS GIVEN I£< TrtH K-INAHY HATfilX OF 
C^'.'-ni^MHMSIOia <L*M> CALLED DOCLiMENT; TAE BIWAHY WATHIX OF 

DIM Si 0« (mt^> CALLED QUERY; THE BfNARY MATRIX OF 

C*ic*-*i.J)mS«SlOW CM*L) CALLED P.ELEVAWT* WIS uAST HA.THIX IS TKE 
C**** *0P-ftA7ORS DEFINITIOM OF WHAT OOCUHSNTS AHE RELEVANT TO WKI Cri 
CI'* OUSRIES. DEFIWSB HELEVAJJCS IS IMQl GATED BY A "TS AND DEFINED 
C*****I f<IlHLEVANCE IS iNJriCATED BY A "0". 



IMPLICIT INTEG2R(A-Z) 

emAfly OOCLHBNT<L#^»>» QUERY<M*N>» RS-E^><WrC»4>L) 
BINARY SYNOMYrt<M*W> 

INTEGER TEWTEmcM#N)» CORRia*ATI0Nm*L),MATCHVALUECfl> 



C** *^*.* * ************* SECTION A ******************************* 
Cu '^ < «"IK'FUT DATA TO BASE THE THESAURUS OW 

READ <( DOCIWENTC I »K>*K« 1*N)* t= UD 
HEAD < C 9UERY< J*K)*K=» I'M)' J- I'W) 
PEAD CC RELEUANTC J* I>#I- I'D* J« I'H) 

C***.*:i..2:EF.0 OUT THE SYNOsaYW AND TERM- TEEM MATRICES 

DO 100 Kl* l*'^ 

CO 190 Kgal'N 

TSRWTERM<K1,K2) = 0 

SYiMONYH(Kl*K2) = 0 
100 CONTINUE 

1 TERATI 0N=* 0 
MA;tI TEF.ATI Ot*^ 20 
A=5 
9=fi 

SYNTH P.FSriOLO:: 2C 
CRI TERI ON-L*M/20 

C** r ** «:# k ieJe.|t**** k*..<«e SECTION 9 ^Js^.-Jt-fs***********^!:*************** 

C*kv»:vcoH EACH OUEHY- DOnUMBJT FAIR* CALCUi-ATE WK CJRREL'^TION 
C*%* ^- -jCOEFFI CI HNT AMI) STOh?, IT 
290 W lG0t^ J- 

00 ie0« 

WATCH«0 

LeNCiTHa0 

1)0 000 K 1- 1«N 

COrt 01 TI On- 1+ DOCuM EW T( I * K 1 ) •^ QUI- PY< J. K I ) 
HL 60 70 < ^^IdO' 500# 700» 300) # COM 01 TI ON r - 




Cr*>^»*r£RH IS NOT IN QUERY OA DOCUHENT 
/|a^0 GO to 9^0 
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C: 't;.TEfW IS !N SOW QUefiV AI»D DOCUMEMT 
GO TO 900 



Ci--''.-<WHEW CALCULATING THE CXDftRELA7lOW> TAKE IWTQ ACCOUNT t.LL 
C*. '"^^TWfe SywOWYMS UHtCh! AJiE INOIftECT TERM MATCHES. 
C>^-!'<*T£»^ IS IN OOCUMByiT* BUT NOT IW OUEftf 
500 Lei*6TH=LENGTHi- 1 

DO 600 K:2~ UN 

IF ( J,Ka) . Ea. 0) 60 TO 600 
IF ( SfJ^ONYWCKl^ka) .EQ. 0) GO TO 600 

IF < OOCINENTCUKS) • HQ* 1) GO TO 600 e 

HATCH»KATCH+l 

CO TO 900 

600 CONTINUE ^J{^ 

GO TO 900 

C!'***kTEm IS IW QUEHY* BUT NOT IN SOCUMEMT 
700 LERIG1H=LSNGIK+ 1 

GO TO 900 ^ 

900 CONTINUE 

C^****HQmfiLlZE IKE CORRELATIONS TO A 0-100 SCALE 

CORRflLATlONC J> 1)«<MaTCH* lfi0) /LENGTH 
1000 CONTINUE 

C.-***ALl. COHSELATIONS OF <>- 0 PAIRS KAVE BEEN CALCU-ATED AT THIS 

C**i.***'|e-i.^*ie**i»t*4e**** SECTION C *.1«****ic*****1t**i«**************1« 

Ofc^'^^ + FOr? EACH QUERY.. 

00 1300 «I»1«M 
HIGHI P.HELEVa»T»0 
LOWKEL EUAN T- .100 

^ DO 1800 l^UL 

IF C RfiL£<^ANT( J* I) .EQ. !> 60 TO 1100 

C*^vie:»:FIWO THE OEFI(S>ED IRRQ-EVQ*T XX>CtMB(«T WITH THE 
C*'"'*i'HlGHEST CORRELATION COEFFICIENT. 

H I GHI RRSL EVEN T^M AX< H I (JH I RRSL EU»J T* CO/?REL ATI ONC J, I > ) 

GO TO 1S00 

C'=':'^!'*AND FIND THE 01 FINED RELEVANT OOCIMENT WITH 
C**^cr*tlHE LOIilEST CORRELATION COEFFICIENT- 
1 100 LOWREtEVANTartlNCLOWRELEVANT, CORBBLATIONC J* I ) > 
1200 CONTINUE 

C*****AND SET TKE MATCH VALUE LEVEL «I DWAr BETWEiN THESE VALUES. 

tH ATCH U AL UE< .J ) » ( LO WREL EVAN T-i-H I GUI RREL EVEN T) / 2 
1300 CONTINUE 



TCAS£5-*« 

00 2000 •J= 1>M 

DO ne00 i-uL 

IF CCORRELATICNC J*n *GE. HATCH VALUEC J> > GO TO 1500 
IF (RELEVANT? J* U • £Q« l> 60 TO 1^00 

c DOCWeslT IS CALCULATED IRRSLEVAMT* AMD 
C^^^^^OEnNED IRRELEUANT> IT IS A CASE "R" 
r40UIFIER»-A 
RCASES«RCASES* 1 • 
60 TO 1700 

C-^t+^^ + ic'miS DOCUMENT IS CAI-CULATeO IRRH-EVAMT* AMD 

Ct*^** DEFINED RELEViflNT* IT IS A CASE "T" 

1400 «0£I FI iR-*ABSC CORRELATlOMC J* D-MATCHV/ALUEC J) > + B 

TCASES^TCASES^- 1 

GO TO 17( 



1500 ir c Ra.#>ANT( -eo- i> go to 1600 

C**4c*iciHIS DOCUMENT IS CALCULATED RELEVANT* AND ^ 

C>:*-.c*i.iTFtN£D IRRELEVANT* IT IS A CASE "S" ^ ^ 

MO DI FI ER= - AB S( CO RRS- ATI ON ( J* I ) -M ATCH VAL VEC J ^ ) - A ^ 

SCASES^SCASCS^- 1 A? 
GO TO 1700 

Ci,-.*:;c*THIS DOCUMENT IS CALCU-ATED RILEVANT* AND 
C*****r.E?IWED RELEVANT* IT IS A CASE "U" 
1600 * MODinER«^B 

UCASES==UCASES* 1 

1700 CONTINUE 

C.,t<.-vv-Hici.*^****^***** SECTION E + 

c.^^at wi*; point* we modifier has been defined according to 
^:.:::Shat CAS£ WE «uery document pair is at this iteration. 

00 19 00 Kl=l»N ^ ^ .^^^ 

IF C00CUrtQ«T(I*Kn .ee. OUERfC J*K1)) go to 1900 

OQ 1600 K2= 1*N ^ 
IF (D0CUMEt<TCI*K2) .LE- 0UERVC.J*K2)) 60 TO 1800 
Cv«e.**AT THIS POINT* TERM K 1* AND TEfiM *i2 ARE A POTENTIAL 
Cv**,.^ SYNONYM PAIR FOR THIS QUEW- DOCWEW T PAIR* 
C*!ev*^SO THE CORRESPONDING ENTRiT ISMODlFlf-D. 

TEW TEPM < K I * K 3) - TEW4 TERM < K 1 » K f») *elb Dl FI ER 

TERM TE W ( K 2* K 1) TERH TERM< K 1 > K f2) 
1800 CONTINUE 
1900 CONTINUE 
2000 CONTINUE 

ERIC 
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C ' NOW TKE SYWOWVW WATRIK IS REDEFINED »f ^-.^cuoc n 
C- Kv-.WHICH TE»8-TEIft ENTRIES ARE NOW ABOUE THE THRESHOLD 

DO aaaa Ki^i^tf 

If CTEWTEfW(KUKa) -GT. STNIK RESHOLD) GO TO 2100 
SYNONYMCK UK2)«« 
GO TO 2200 

2200 CON TIM US 

C...*,^i.-..-^<^i^****i'*-K* SECTION G -K***^******************.*.!.***** 

C.v.-:.F1M0 OUT IF THE PB06RPW HAS ^^^^'f^^^fJ'^^f^l^^fi 
C*..^|.THE MAKIHIM NU4BER OF ITERATIONS. IF «0T, LOOP! 
I TERATl 0N« I TEBATl OW* I 

Ul TE I TERATl ON, RCASES, SCASES, TCASES, OCASES 
TOTAtrtI SNATCH=SCASES+TCAS£S 

I cTlOTALMlShlATCH .LE- CRI TERIONi) GO TO 2300 
IF (I TEBATl Ov<0 .UE. WAXI TEBATl 0K> GO TO 200 
2300 WRI TE C < SY fJONWK N I* N 2) * N 1« W«> ' I' W> 

STOP ^Sl, 
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matrix is changed to reflect the updated state of calculated synonymy* 
In section program status is printed out and a decision is 
made whether to iterate (back to section B) or print out and halt* 

S. Implementation 

A program essentially identical to this* but modified for a 
DEC PDP-11/20 was run for small and large collections of data. ^ 
small data sets» convergence was reached in two to five iterations* 
The larger data set was about 80% of che ADI collection » and while 
resales are promising » convergence on this data set has not yet been 
attained* 
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Content Analysis and Relevance Feedback 
A, Wong, R. Peck, and A, van der Meulsn 

Abstrc^ct 

Content analysis is a vital part of any automatic document 
retrieval system where natural language has to be analyzed in order to 
detect the information carrying parts. The ai^signment of appropriate 

identifiers to the documents and queries — "the indexing process" 

can be carried out on different levels of complexity which generally 
agree with different levels of system performance, 

A device for indexing improvement of a quite different nature 
is the "user feedback" technique which can be app3 ied in an interactive 
retrieval system. The initial indexing which is a result of a rather 
imperfect content analysis can be corrected and impx'oved by using the 
judgment of the user concerning the relevancy or non-relevancy of his 
retrieved documents. 

This report deals with the question: how critical is the quality 
of the content (language) analysis which results in the iuitial indexing 
in an interactive retrieval system? Since in such a system ft>edback 
techniques will improve system performance substantially one could doubt 
if original defferences in content analysis will still affect the final 
performance, tf such differences in indexing refinement turn out to be 
retained after the feedback is applied, every improvement in initial 
indexing should be put into practice; in addition a good justification 
exists for working in the area of automatic dictionary construction. 
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1 * Introduct ion 

The performance of a document retrieval system depends heavily 
on the transfwmation of the natural language of the document into an 
artificial retrieval language. The document description in such a 
retrieval language results in a much shorter representation compared 
with the original one» and it is this indexing process which determines 
how effectively th-i* document can be x^etrieved. 

• If one defines as "ideal indexing proce5s" as a process which 
results in retrieval of only relevant documents in response to 
queries, it is likely that, even with the most refined techniques » 
ideal indexing does not exist. The two reasons are: 

- the transformation of the author's ideas into a written text 
may be imperfect; moreover, during indexing usually only 
title and abstract are considered. 

- the existing language analysis tools are imperfect. 

The first reason in particular will always limit the quality of the 
indexing process even if ideal langtiage analysis devices were available • 
When dealing with large collections of documents, the automation 
of text analysis becomes a necessity, since manual indexing may not 
then be a realistic alternative. Evaluation tests for a long period 
have shown that manual indexing v«as superior to automatic indexing. 
Nowadays the roles incline to change. This is mainly due to the fact 
that in modern interactive retrieval systems the implementation of 
user feedback strategies (index corrective mechanisms) yield a 
considerable improvement in system performance. To be sure, thos'^ 
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interactive strategies are in principle also applicable to computerized 
retrieval systems supplied with manually indexed documents; but usually 
such a system organization does not allow the implementation of an 
effective feedback algorithm. 

Concentrating first on the automatic content analysis tools, 
one may distinguish two basically different approaches: 

a) Non-lingusitic computer techniques. 
Kxamples are: 

- the automatic stemming of words followed by the counting of 
their frequency of occurrence; 

- procedures for the automatic detection of common words; 

- statistical procedures for the automatic construction of 
dictionaries; and 

-• the automatic creation of weighted dictionaries, where the 
weight reflects the description power of an identifier. 

b) Computer oriented linguistic techniques. 

In those procedures the syntactical meanings of sentences are 
taken into account rather than mere keywords; also phrases may 
be recognized in an explicit way. 

Such a syntactical analysis, was supposed to fill the gap 
between a pure mechanical analysis (category a) and an intellectual 
manual one. To date the situation however is such that appli- 
cation of the now existing linguistic techniques deteriorate 
system performance, rather than providing a substantial improvement. 
In this report therefore only statistical language analysis 
will be considered (category a). 



The retrieval results obtained with the available automatic indexing 
devices are far from satisfactory, and one might doubt if it ever will 
become possible to improve the initial indexing sufficiently. 

To simplify the indexing problem, user feedback techniques may be 
used, based on the premise that index corrections can be made dynamically 
during the course of the search by utilyzing the judgment of the user con- 
cerning the relevancy of his retrieved documents (IJ . One may refer to 
"relevance feedback" and "document modification" as corrective indexing 
techniques which allow both the reindexing of a query (relevance feedback) 
as well the reindexing of documents (document modification). It should be 
mentioned here that those corrected indexes are not static entities, but 
dynamic, in tune with actions performed by the system users. 

Relevance feedback in particular has proved to be a powerful technique 
which increases system performance significantly, since queries are in 
general short and poorly formulated. One must ask then how imperfect the 
initial indexing may be while still yielding results after feedback applica- 
tion which are comparable to those of the more refined initial indexing; 
in other words is the final feedback result a function of the initial indexing? 

If final results appear to be independent of initial indexing, 
most efforts concerning content analysis and in particular dictionary 
construction might be less meaningful than they originally appear to be. 
If those results show some improvement due to better initial indexing what 
compromise must be made between indexing expedience and returns in terms 
of system performance after feedback is applied? But if the final results 
are highly dependent on the quality of initial indexing the answer is clear. 
All efforts directed towards content analyr.is improvement remain of 
particular importance. 
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2. Experiments 

The purpose of the experiments is to investigate the influence 
of language analysis tools on final system performance after a sufficient 
number of feedback iterations have been executed. ''Suff icient'* here means 
that a new iteration will not further affect the obtained system perfor- 
mance. In the SMART environment two or at most thz^ee iterations are 
normally sufficient. The experiments are likely to be dependent on the type 
of feedback algorithm used, which will be Rocchio's in all cases. 

A) Used Language Analysis Tools 

The main analysis tools provided by SMART include three dictionaries: 

- the word-foxnn dictionary, 

- the word- St era dictionary, and 
the thesaurus 

The word-stem dictionary (suffic deletion) is a refinement of the word-form 
dictionary (plural s endings deletion), and the thesaurus (grouping of 
related terms) is a refinement of the word-stem dictionary. 

An improvement which can be applied to each of the existing dictionaries 
is the application of the so-called ••discrimination values** [2], that is, 
of quantities which reflect the descriptive power of the dictionary items 
(terms). More specifically, the discrimination value is a measure of the 
change in average correlation of a document collection to its center-of- 
mass (centroid), measured first using the term as an index ^ and again with 
that term deleted. If the collection moves closer together, when term i is 
deleted, that is^ if the average correlation with the centroid increases, that 
term is valuable in distinguishing individual documents. 
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The application of discrimination values can be carried out in 
two possible ways: 

- deletion of bad discriminators which process however will 
not be considered in this report, and 

- creation of a weighted thesaurus [3] , 
B) Comparisons 

Two collections are considered in order to justify generalizing 
the results obtained in these experiments. They are: 

- the TIME collection consisting of 425 documents in the 
political science field provided with 83 queries, and 

- the ADI collection consisting of 82 documents in documentation 
provided with 32 queries. 

Two main experiments are carried out: first a comparison is made 
between four different dictionaries using the TIME collection. Compared 
are the system performances using a word-form dictionary and a thesaurus 
both with and without discrimination value application. System 
performances before and after two feedback iterations are considered. 
Second a comparison of a word-stem dictionary and a thesaurus using 
the ADI collection is made. System performances before and after three 
feedback iterations are considered. 

In the comparison of feedback results no attempt is made to go 
into complex evaluation schemes [4,5], The system performances are 
expressed in simple recall-precision curves, which are suitable for the 
outlined purposes. 
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3. Experimental Results 

The results for both collections (Figs. 1, 2, 3, and clearly 
demonstrate that in all the investigated cases differences in initial 
performance are retained in the final precision-recall curves. In 
Figures 1, 2, and ri, the word-form dictionary serves as reference curve 
and is compared with the weighted word-form dictionary, the thesaurus, 
and the weighted thesaurus. In Fig, 4 a word-stem dictionary is compared 
with a thesaurus. 

From the recall-precision curves one may draw the remarkable 
conclusion that the shape of each curve, reflecting a special dictionary 
performance, remains invariant after the feedback operations, however 
the position of the cxarves is lifted. Also the relative ordering of 
results of various dictionaries remains invariant. 

In the case of the TIME collection a better initial performance 
curve is inclined to lift relatively more (Figs. 1, 2, and 3); this 
results in a spread out of the final curves. The ADI collection shows a 
slight "wash-out" effect in that initial differences are diminished. It 
has to be noted that the initial word -stem dictionary performance is 
fairly poor (low recall -precision curve), which explains the wash-out 
effect . 

Conclusion 

The results indicate clearly that the final system performance, 
that is, the final retrieval result after user feedback is applied, is 
highly dependent on the system performance of the initial indexing process. 



vii-a 



word-form dictionary 




Initial runs 



\ 



\ 

\ 



^ weighted word -form 
dictionary 




\ 



second iteration 
feedback runs 



Precision 
4 



Recall 



Performance comparison between a weighted and 
an unweighted word-form dictionary, before and 
after two feedback iterations (TIME collection) 

Fig. 1 



T 



vri-9 



K - V, 



word -form dictionary 
thesaurus 



.8 " 



■4 1- 



■n 



Initial runs 



Second iteration 



r 



Precision 




Recall 
— > 



Performance comparison between a thesaiirus 
and a word-form dictionary before and after 
two feedback iterations (TIME collection) 

Fig. 2 



.2 



I 

.8 



1.0 



ERIC 



VH-10 



---- » word -form dictionax'y 



L M «- 



» weighted thesaurus 



8 



X n^. 



J IK: Jt 1^, 







7 



Initial runs 



5 " 



Precision 
A 



Recall 
^ 




Second iteration 
\ \ feedback runs 



Performance comparison between a weighted 
thesaurus and a word-form dictionary before 
and after two feedback iterations. (TIME collection) 

Fig. 3 



ERIC 



4- 



1.0 



vii-ai 



,9 " 



= word -stem dictionary 
= thesaurus 



.8 



.7 



.6 " 



,5 



.3 *. 



,2 




» \ Third ite).''ation 
* ^ feedback runs 



-fi \ 



Precision 



.1 -f 



< n X 



Recall 
> 



^ 



Performance comparison between a thesaurus 
and a word -stem dictionary before and after 
three feedback iterations. (ADI collection). 

Fig. 4 



ERIC 



75 



1.0 



VII-12 



It must be noted that this conclusion is derived for the application 
of Rocchio*s feedback algorithm; other feedback mechanisms such as for 
example the replacement of the original query by the index of a retrieved 
relevant document, might yield different results. Unfortunately no evaluations 
between different feedback strategies are available, but Rocchio*s is at least 
the most estcd>lished one. 

It is for this feedback strategy that one may state that every tool 
which improves the indexing performance as an outcome of the content analysis 
of natural language is beneficial because initial differences in system 
performance are retained after user feedback is applied. 
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On Controlling the Length of the 
Feedback Query Vectors 
Karaunvir Sardana 

Abstract 

various strategies? for reducing the lengths of feedback 
vectors, which get elongated during regular feedback processing, 
are tested. The results show that the strategies based on the 
knowledge of the discrimination values of the concepts, are quite 
successful. The bejit aajuc'.ged merhod retains the top best 
discriminating concepts in every feedback query vector. 

1. Introduction 

A) Indexing 

All automatic document retrieval systems use some method 
or other for converting a natural language document or a query into 
a form that is representative of the corresponding document or the 
query and can be stored internally in a computer. This process is 
Known as indexing and is quite important because much of the 
efficiency of document retrieval systems depends on it. In the SMART 
automatic retrieval system [1], indexing transforms a piece of 
natural language text into its representative concept -weight (c-w) 
vector form meant for internal storage in the machine. A concept or 
a term is an atomic entity, a word or a phrase used to describe the document 
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whereas the associated weight denotes the concept *s importance in the 
document t In the existing SMART system^ which is taken as a test 
environment in the present study, the weight of a concept is normally a 
linear function of its frequency of occurrence in the text of a document 
or a query. In this report, the c-w vectors used in the SMART system are 
referred to as standard vectors or simply as vectors. 

B) Length of a Vector and Importance of Controlling it 

The length of a document or a queicy vector (referred to simply as 

a vector in the sequel) is defined to be the number of index terms or 

concepts (that is, the ones with nonzero weights) constituting the vector. 

The length of a vector affects the overall retrieval process in the 

following ways: 

i) The storage space occupied by a vector depends on the number 
of c-w pairs, that is, the length of the vector, and the 
storage needed for one c-w pair. In the present SMART system, 
each c-*w pair occupies one computer word of storage. 

ii) The process of correlating two vectors is frequently used for 
document searching in retrieval systems. The correlation 
measure widely used in the SMART system can be graphically 
interpreted as the cosine of the angle between the two vectors 
in the n-dimensional Euclidean space. The cosine correlation 
coefficient between the vectors 

12 n i 

?==(?*?»•••»?)» where p is the weight of the 
•th 

1 concept 

and 

12 n 
Q = (q , q , ... , q ) 
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is computed as 

n . . 

r 1 1 

.Z p q 

COR(P,Q) a i^i 

When very long query vectors are matched with the 
document vectors, the cosine coefficient tends to be 
small because a factor proportional to the vector 
magnitude* appears in the denominator 17 J . If the 
query vector is reasonably short, the vector magnitude 
is smaller which implies a larger value for the 
correlation coefficients. If the user has specified a 
threshold value in the correlation coefficient to 
distinguish retrieved from nonretrieved documents, the 
use of a shorter query vector will produce a larger 
number of output documents than the corresponding longer 
query vector. 

iii) The time required for the correlation process, using an 
algorithm that stores only the nonzero weight concepts 
of the vectors (as is done in the SMART sys?tem), depends 
on the length of each vector. This is so because one 
must compare the two lists of concept numbers in order 
to determine the matching concepts. 

Thus, it is important that during various phases of 
processing through the retrieval system, such vectors not only be 



*As used here, the vector magnitude of a vector P is I ^ (p^)^J^^^ 

i=a 

and the length of a vector is the number of concepts with nonzero 
weights in the vector. These definitions of vector magnitude and 
the length of a vector are consistent with Murray's [21 definitions. 
Note that Salton's 17] definition of vector length is different 

from the corresponding definition used here and is the same as that 

of the vector magnitude in the present context. 



fully representative of their intended meaning but also be short and 
concise to give high correlation coefficients and to save on storage and 
searching costs* 

This implies that there is a definite need for controlling the 
length of vectors which have a tendency to grow long as a result of 
retrieval processing. This roust be done by trimming the elongated 
and unwieldy vectors such that their shorter versions carry the gist of 
the correspording originally long vectors. 

2. Earlier Results 

A) Murray's Strategy for Reducing the Vector Lengths 
Earlier esqperiments in this area have been conducted by Murray 12] 
for controlling the lengths of profile vectors. A profile vector (cluster 
centroid) is a vector that represents all the vectors in a cluster. Murray 
suggests reducing the lengths of profile vectors by chopping off 80% or so 
of the concepts with the lowest weights (and thus the frequencies in the 
standard vectors)* which according to his recommendations * results in only 
a slight decrease in retrieval performance. This method can similarly be 
used for reducing any vector length for that matter. 

The idea is to remove those concepts in a vector which because 
of their relatively small weights, will not affect the orientation of the 
vector in the vector space by much and, therefore, will not appreciably 
influence the correlations of this vector with any other. Further, a 
detailed analysis by Murray shows the following justification for using 
this strategy : 
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Let P be a vector whose length is to be reduced and Q 
be another vector with which the vector Q is to be (conine) 
coxrelated during the course of document searching. Then, the 
contribution to the total correlation, by matching-conceplj* with 
weights and q^, is 

CONTRIBUTION - ^ — • 3— - E . Q 

|P| IQI if(p3)2,i/2 

3=1 j=l 
For a fixed vector Q, the values of q^/|Q| are fixed and 
variations in contribution are due to p^/|P|, called the correlation 
contribution ratio. 

Now, as in the present strategy for reducing the vector 
lengths, the lowest weight t«rajs of vector P are thrown out, the 
correlation loss due to these terms is small and the retrieval per- 
formance is not affected. 

This strategy is valuable because a vector can be trimmed 
to only 20% of its original length while sacrificing only a little 
in retrieval performance. 

B) Other Related Results by Murray 

Some other related and interesting ra^iUlts regarding profile 
vectors by Murray are presented, to be used later on in the 
discussion. These results are: 

i) Weighted profile vectors are significantly better in 
performance than unweighted profile vectors. 

ii) Profile vectors consisting of concepts whose weights are 
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rank values*^ give superior performance when base values are 
small. A rank value is the difference between a base value 
and the rank assigned to the term if all terms in the vector 
are ordered by decreasing frequency. The base value is a 
constant chosen large enough to insure that all weights are 
positive in the profile vectors. 

iii) Profiles with concept weights based on frequency ranks** give 
performance better than the standard or rank value vectors. 
Such profiles avoid correlation domination by larger weight 
terras while at the same time allowing smaller weight terms to 
have a relatively little more say in determining the correlations. 

iv) Selection of good index terms is more valuable than making fine 
frequency distinctions among important index terms. 

V) Using a few broad categories, typically four, of weight classes 
gives performance equivalent to that obtained by using a larger 
number of weight classes as used in the standard weighted vectors. 



3. Present Problem 



A) Origination 

Rocchio-type formulae are widely used for relevance feedback, and they 
have been shown to result in an improved retrieval performance II]. One side 
effect of using relevance feedback in this manner is the growth in the length 



*The vectors formed in this way are called rank value vectors. 

''?''*?'^ ^® constructed by arranging the concepts 
constxtuting a vector m increasing frequency order and then assigning 
the weight of a concept equal to its rank in such a list. The concepts 
with the same frequencies are assigned the same rank. It should be 
noticed that the frequency ranked vectors are essentially the rank value 
vectors for which the base value is chosen in such a way that the 
minimum weight of the concepts equals unity. 
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of the feedback query vectors (fqv»s). The elongation of the fqv»s 
can amount to as much as twelve times the average length of the 
corresponding original query vectors. To some extent, the growth 
in the lengths of the query vectors is quite desirable because 
as Murray 12] has also observed, the original queries tend to be 
short and omit the background inaterial that might really be 
helpful. On the other hand, too long a set of query vectors with 
possibly a lot of unimportant terms could damage the retrieval 
performance in addition to using up more storage and costing more 
in retrieval searches. The idea, then, is to reduce the elongated 
fqv*s to their optimum length somehow. 

Furthermore, it would be interesting to see if relevance 
feedback can, in some way, be reinforced by using the knowledge of 
discrimination values (dv»s) of concepts in the fqv's. The theory 
of discriminating power of individual concepts has been expounded 
by Bonwit and Aste-Tonsroan 13] and developed further by Crawford [1). 

B) Exact Definition and Scope of the Problem 
The problem at hand is to discover strategies to trim the 
fqv*s to vectors of manageable shorter lengths such that the 
retrieval performance obtained by using the trimmed versions of 
the fqv»s is better than or at least equivalent (if possible) to 
that obtained by using the original elongated fqv»s. Second part 
of the prpblem is to discover means to augment the relevance feedback 
by using the dv»s of concepts utilizing ideas of Bonwit and 
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Aste-Tonsman 13] and Crawford [4] . Tlie results are to be compared with 
Murray's work [2]. 

It is worth recalling that some work in feedback reinforcement by 
term dv*s has already been done by Bjorklof 15] and Doeppner, Finley and 
Peterson 16] using some strategies which have not met with much success. 

C) Methods and Solutions in Brief 

It is proposed to use the information of dv's of concepts in 
achl*>ving both the ends. Briefly, the procedure to be used is as follows: 

i) Take a document collection and the associated queries. Aflax- 
the original iteration of document searching, do one iteration 
of regular Rocchio-type feedback. Note its performance and 
save the resulting fqv*s. 

ii) Order the concepts in each elongated fqv in decreasing dv order. 

iii) Retain top n concepts of the ordered vector, where n depends 
on the particular vector and the strategy used, as explained 
later. 

iv) Reorder the shortened vector back to the original ordering of 
the concepts. 

v) Do the original iteration of document retrieval search using 
the trimmed fqv's and compare its performance to that of the 
regular feedback from step (i) above. 

The motivation for this approach is as follows. In Murray's work, 
the lowest weight concepts are shown to have small correlation ratios and 
discarding such terms is not considered to be harmful. On the other hand, the 
theory of dv's of terms and specifically their non-monotonic relationship with 
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frequency of occurrence of the corresponding terms, as shoim by 
Crawford [k] , indicates that some presently low weight but good 
discriminating terms could be of potential importance in the 
present and future correlating process. Thus, throwing away auch 
potentially valuable terms might hurt the retrieval performance, 
while retaining them might really help. 

Vector Trimming Strategies 

First, some notations are given and operations are defined: 

i ) Notations : 

a) X — 5 — ?> Y: Operator 0 operates on the initial 

vector X to yield the final vector Y. 

^) V : Elongated fqv which is to be reduced 

in length. 

c) A-order : Original alphabetical order (of the 

ccxncept numbering). Thus a vector 
in A-order means that its concepts 
are numbered in original alphabetical 
order. 

d) D-ord«? : Decreasing dv order (of the concept 

numbering). Thus a vector in n-order 
means that its concepts are numbered 
in decreasing dv order. 

ii) Possible Operators, 0: 

a) A/D : Concepts of the initial vector are 

renumbered from A-order to D-order 
to yield the final vector. 
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b) D/A : Concepts of the initial vector are 

renumbe-rcd from D-order to A-order to 
yield the final vector. 

c) Murray n : From the initial vector, top n% of the 

total number of concepts in their 
decreasing weight order are retained to 
form the vinal vector. 

d) Fixed n : The final vector is composed of fixed 

top n concepts of the initial vector. 
If the l^^ngth of the initial vector is 
less than n, then both the final and the 
initial vectors are identical. 

e) DV Rank n : The final vector is composed of all 

concepts of the initial vector with dv 
rank (to be defined shortly) less than or 
equ^l to n. 

Utilizing these notations and operators. Fig. 1 is a tree illustration 
of ihGt algorithms of all the four strategies for reducing the vector lengthc. 
Assuming the original query vectors, Q of Fig. 2(a) for illustration purposes, 
a brief description of each of the strategies follows: 

A) Strategy I 

As described earlier, this Is Murray's [2] strategy and for every 
vector V, it retains the top n% of the total number of concepts in their 
decreasing weight order to produce a shortened length vector V . Suggcfited 
value of n is around 20. Fig. 3 shows the vectors V obtained from vectors 
V of Fig. 2(b) » by 70% reduction in length. 

For the following three strategies, the first common step is to renumber 
the concepts of each elongated vector V from their A-ordering to D-ordering. 
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Call the modified vector V®. Fig. 2(c) shows the vector V*' obtained 
from vectors V. 

B) Strategy II 

This strategy places a fixed upper bound on the length of each of 
the reduced vectors. This yields standardization of the lengths of vectors 
and could make programming a little easier and more efficient. 

A fixed number n is chosen; top n best discriminating concepts 

(if they exist, otherwise all) of V** (which is in D-order) are retained and 

renumbering of the concepts to A-ordering is done to achieve the reduced 
2 

length vector V . figs. 4(a) and 4(b) depict the last two steps in obtaining 
vector from for n = 30. 

Suggested value of n is the average length of the document vectors 
in the document collection. 

C) Strategy III 

Here, the idea is to retain all those concepts which are the best 
discriminators. Specifically, dv's of all concepts present in the document 
collection are calculated using Crawford's methods 14] and the concepts 
are ranked in D-order. The rank of a concept in such an ordering is called 
dv rank of the concept. Thus, for example, any concept number in vectors 
of V° is equal to its dv rank. 

A dv-dv rank curve is plotted for the collection. The curve is an 
exponential looking curve for most of its range (Fig. 5). This curve has 
a sharp drop in the beginning and then approaches the X*-axis asymptotically 
before it goes negative very steeply. 
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A value m for dv rank cutoff is determined from the curve; all 
concepts in with dv rank <^ m are retained and then final 

renumbering of the concepts from D-order to A-order yields the final 

3 3 
reduced length vector V . These last two steps for obtaining V 

fvom are exhibited by vectors of Figs. 6(a) and 6(b) ♦ 

A recommended value of ro, the dv rank cutoff, is the one near 

the foot of the first steep of the dv-dv rank curve. More exactly, 

that value of m is chosen where the slope of the curve is <. e, 

where e is a constant for the particular collection. A value of 

e » 0.00004 giving m » 500 is found appropriate for the document 

collections used in this project. A method for determining c for 

a particular document collection is given later in Section 6(B). 

D) Strategy IV 

This strategy is an intermediate between the last two 
strategies and using it the length of any reduced length vector is the 
minimum of the lengths obtained by using strategies II and III. It is 
same as the previous strategy except thai the length of each vector 
is further triinmed to n just before the concepts are renunibered back 
to A-ordering. Thus, in addition to using a dv rank cutoff m like 
Strategy III, it also places an upper bound n on the length of each 
vector like Strategy II, to yield the final reduced length vector 
V** (Figs. 7(a) and 7(b)). 

Note that this strategy is also equivalent to using Strategy II 
with a maximum fixed upper bound on lengths being equal to n, an<* iu 
addition, trimming the vectors further down such that all concepts 
with dv rank greater than m are eliminated. 
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As compared to the last strategy, the advantage of using this 
strategy is s'sen in getting standardized fixed length vectors, although 
some loss in performance is expected due to the loss of good discriminators 
(upto dv rank cutoff m) in some of the vectors. 

5. Experimental Environment 

A) Retrieval System 

The SMART automatic document retreival syrstem is used an the test 
bed for conducting the experiments. This retrieval system is a major facility 
for conducting experiments to test and evaluate various document retrieval 
strategies. 

B) Data Collections 

IVo different SMART document collections used in the present series 
of experiments are MEDLARS and CRANFIELD. iTie MEDLARS collection used is a 
U50 document subset of the originally larger collection dealing with varied 
medical literature; the associated number of queries is 30. The indexing 
procedure used for representation of the documents and the queries in the 
machine utilizes a word stem dictionary. 

On the other hand, the CRANFIELD collection consists of a more 
homogeneous set of H2H documents dealing with aerodynamics; the associated 
queries form a 125 query subset of originally 155 query collection. The 
indexing process used for this collection makes use of a word form dictionary. 
The procedure of indexing making use of dictionnries has been discussed in 
detail by Salton 171. 
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In many respects, these two document '^ollections are 
different enough to warrant putting a great deal of confidence into 
the results based on the experiments done on them* 

C) Clustering Parameters * 

The experiments are conducted on clustered document collections 
because Murray 12] has shown that document retrieval based on clustered 
files is more efficient than the one based on inverted files or 
individual documents » for instance. 

A clustered CRANFIELD document collection is available as a 
SMART collection, while the MEDLARS collection was clustered for this 
experiment. The clustering parameters used are: 

i) SMART routine CLUSTER is used for clustering by Rocchio's 
algorithm 18]. 

ii) The loose documents are placed with the centroid v;ith 
which they correlate the highest. 

iii) The algorithm is allowed to choose an optimum number of 
documents to be batched for checking as possible cluster 

XK)OtS. 

iv) For Rocchio's density test, to be a cluster point, at 
least 4 documents must iiave a correlation greater than 
0.3 and at least 8 documents must have a correlation 
greater than 0.1 with it. 

v) Minimum and maximum number of documents in each cluster 
are 8 and 25 respectively, excluding the loose documents 
blended in later on, as determined by step (ii) above. 
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D) Searching Parameters 

The SMART routine SEARCH is used for document r.earcfiing experiment:'.. 
The parameters used throughout this study are as follow:;: 

i ) Feedback Parameters 

a) The numher of documents retrieved for each iteration is 30. 

b) Among the top 10 relevant documents, all the relevant onea art' 
added to and the top nonrelevant subtracted from the original 
query to form the first iteration query. Only one iteration of 
feedback is carried out for the present experiments. 

ii) Cluster Searching Parameters 

a) At least 40 documents are correlated with the query on each 
iteration. 

b) At least 3 and at most 10 cluster nodes are expanded for each 
iteration. 

c) Any nodes whose cosine correlation with the query is within 
0.01 of the latest node selected for expansion are also expanded. 

E) Evaluation Techniques 

The basic evaluation technique used for the comparison of various 
retrieval search runs, is the Precision-Recall (P-R) curve [7]. Even 
though none of the fluid or frozen feedback searches provide the true 
retrieval perforroance while -che more exact test and control group feedback 
method [9] is time consuming, the fluid searches are chosen with the intention 
of making relative comparisons only. 

The "ranking" and "feedback" effects occurring in relevance feedback, 
as discussed by Hall and Weiderman [10] are analyzed manually. Particularly, 
the "feedback effect", which measures the improvement in retrieval performance 
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due to the new relevant retrieved docurnents^ is considered. 

6* Experimental Details 

A) Overall* Flowchart of the Experiments 

rig. 8 is an overall flowchart of the document retrieval 
search experiments. Basically, as discussed in Section 3(C) also^ 
an experiment to test out a strategy consists of the following 
four steps: 

i) Perform document retreival SEARCH (SMART routine) on 

a collection, using one Rocchio-type feedback iteration. 
Obtain the P-R curves for the original (ORIG) and 
feedback (FDBK) iterations. 

ii) Shorten the elongated fqv*s using one of the Vector 

Trimming Strategies to get the reduced length vectors. 

iii) Use these modified fqv's to SEARCH the document collection 
without any further feedback* Obtain the P-R curve for 
this modified (MOD) iteration. 

iv) Compare the SEARCH results for the FDBK and the MOD 
iterations. 

B) Detailed Description of One of the Experiments 

To give an idea of how these expei^iments ai^e performed down 
to th*^ir inn^ det^.ls, a detailed description of one of the 
experiments is presented. 

Table 1 details the steps of the experiment using the MEDLARS 
document collection and Strategy III for the length reduction of the 
vootox^;. Fi;^^. 2 and 6 givo computer output fixamplen of the vrirtorr; 
at various ntagos through the experimnnt. 
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STEP 
NO. 


INPUT 


ROUTINES USED 
and/or 
OPERATIONS PERFORMED 


OUTPUT 


1 


a) MEDLARS Document 
collection, C 

b) MEDL\RS Query 
collection, Q 
(Fig. 2A) 


SMART routine SEARCH: 
original (GRIG) and 
the first feedback 
(FDBK) iterations are 
performed 


- ' 

i) P-R curves 0 and 
F for ORIG and 
FDBK iterations, 
ii) Punched feedback 
query vectors, V 
(Fig. 2B) 


2 


C 


Crawford * s Discr imina- 
tion Program [U] : gives 
!:he list of concepts in 
D-order. Output format 
is made suitable for use 
in step #7 below. 


List L 


3 


L 

e(a predetermined 
parameter, constant 
for a particular 
collection) • 


A curve showing dv vs. dv 
rank of concepts is 
plotted. The value of dv 
rank cutoff, m, where 
slope of the curve is 
<^ e, is determined. 


DV rank cutoff, m 
(Fig. 5) 


4 


L 


FORTRAN program ALPHDV: 
gives the mapping of 
concepts from A-order to 
D-order . Output format 
is made suitable for the 
next step. 


Mapping M. 


5 


V 
M 


SMART routine RECODE: 
changes concept numbers 
in the definition of 
fqv*s from A-order to 
D-ovdev 


Renumbered fqv's, 

V , obtained as 

V A/D 

(Fig. 2C) 


6 


m 


FORTRAN program RETAIN: 
For each vector of v*^, 
it retains only those 
concepts with their 
concept number 
dv rank) <^ m. 


Reduced length 
fqv's, V^' , in 
D-order (Fig. 6A) 


7 


L 


SMART routine RECODE: 
restores A-order of con- . 
cepts within each vector 
of V • . 


Reduced length 
fqv^s, v^, in A- 
order (Fig. 6B) 


8 


V 
C 


SMART routine SEARCH: 
original iterat'on with 
no feedback is performed 
for document searching 


3 

P--R curve, F , 
showing perfor- 
mance of reduced 
length fqv's 


9 


F 

f3 


Compare P-R curves and 
other retrieval 
statistics. 


Get the results. 



Detailed Description of the Experiment to 
O . implement Strategy III on the SMART system. 

y£ Table 1 
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For this part of the experiment, the value of parameter e wes 
first determined for the MEDLARS collection by trial and error, such that 
the dv rank cutoff lies at the foot of the first sharp drop of the dv-dv 
rank curve (Fig, 5). The value of e is found to be O^OOOOU and the same 
value of e is used for the CRANFIELD collection. This value is determined 
just once for the whole collection. 

Another important point about this experiment is that during the 
length reduction of the vectors, care was taken to prevent the query from 
getting zeroed out completely. Thus^ if after the application of the 
strategy, the reduced query vector happened to contain no concept at all, 
the algorithm took care that at least 5 concepts (if present in the 
original query vector, otherwise equal to number of concepts in it) were 
retained • Similar precaution was taken during the application of other 
strategies as well. 

7. Results 

A) Performance Curves Obtained 

First, the effect of variation of individual parameters n and m 
(n and m are the parameters used in the description of various strategies 
in Section ^) is studied for each strategy separately and performance curves 
are obtained. The best curve obtained for each strategy is taken and 
comparisons are made between these. 

Figs. 9(a) to 9(d) show the performance curves obtained for each 

individual of the four strategies, using the MEDLARS document collection. 

Fig. 9(e) shows the comparison of the bent P-R curves — - one obtained for each 

of the strategies. Furthermore, in each of these figtires, the average and 
the range of number of concepts present in the fqv's .^nd the total number 
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of relevant documents retrieved among the top 10 and the top 30 ranks for 
the whole query collection are alno given* For coToparison purposes, the 
P-R curves obtained for ORIG and FDBK iterations are included in each of 
the figures ♦ 

rig* 10 shows the comparison of the best P^R curves obtained for 
the various strategies for the CRANFIELD collection , except that Strategy 
IV was not tried, because results of using this strategy were expected to be similar 
to those obtained for the MEDLARS collection* 

B) Inference from the Performance Curves 

In the following analysis, all comparisons are made with respect 
to the performance curve obtained by the first regular feedback (FDBK) 
iteration. 

Tables 2, 3, 4 and 5 give such a comparative analysis for the four 
individual Strategies I, II, III, and IV, for the two collections used. 
Inference from these tables for each of the strategies is given below; 

i) Strategy I 

The use of this strategy for the reduction in the length 
of the fqv's results in almost consistent loss of performance 
by as much as 0.06 in precision at any recall level. 

ii) Strategy II 

Reduction in the lengths of the fqv^n, using this 
strategy, results in almost equivalent or better performance 
than that obtained by using elongated fqv*s. Keeping 
maximum f-^^ed length of shortened fqv's equal to the average 
length of the document vectors seems reasonable and gives 
almost the best results, for high recall values, which is most 
likely what the user desires* Using this formula, the length 
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L:i;piAL 

NO. 


COLLECTION NAME 


MEDLARS (Fig. 9(a)) 


CKANFIELD 


1 


Retaining top 30% concepts 
for each fqv, that is 
n = 30^ gives high 
precision for lower recall 
values 


Retaining top 50% concepts 
for each fqv given the best 
precision for all recall 
values, among all the 
reduced length vectors 
tried 


2 


For high recall values 
n 50 gives the best 
results. 


Vectors, with n = 20 give 
a P-R curve worne by 0.02 
to 0.0^ in precision for 
all recall levels. 


3 


Overall results are poorer 
than regular feedback by 
up to 0.06 in precision 
for same recall, for all 
the trimmed vectors 
tried. 

^- . ^ . 


Overall results are poorer 
than regular feedback by 
up to O.OU in precision 
for any recall level, for 
all the vectors trie'^. 



Performance Analysis of Strategy I in 
Comparison to the Regular Feedback 
Performance 
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SERIAL 
NO. 


COLLECTION NAME 


MEDLARS (Fig. 9(b)) 


CRANFIELD 


1 


Initial value chosen for n 
is UO^ being the recommend- 
ed average length of the 
docunient vectors. 


Correspondingly, initial value 
oi' n = 62 is takou, being the 
avex»age length ot" tlie document 
vectors in the eollevalon. 


2 


Values 30, ^0 and 70 for 
n are tried. For all n 
the P-R curves are better 
than the regular feedback 
curve for higher recalls, 
which is probably what the 
user desires. The increase 
in precision at same recall 
is as much as 0.12. 


n - 20 iii the only oiUer cut- 
off tried. 


3 


At low recalls, P-R curves 
for all values tried for 
n are worse than regular 
feedback c 'rve by as much 
as 0.08 precision. 


For n = b2, the performance* 
curve ir. almost equivalent 
to the regular feedback 
curve, while the average 
number of concepts in query 
vectors urops rrom iih io ji , 




At intermediate recall 
Vdiues, n = 30 gives 
better results while n 40 
gives results -equivalent 
to regular feedback. 


The performance curve for 
n - i\j gives consistently 
worse results. 


5 


Overall results are better 
than those obtained by 
regular feedback at 
desired high recalls, for 
all values of n tried. 


Overall, for n 62, the 
results are equivalent to 
those obtained by regular 
feedback . 



Performance Analysis of Strategy II in 
Comparison to the Regular Feedback Torformance 

Table 
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SERIAL 
NO- 


COLLECTION NAME 


MLDuARG (ri^> 9(c)) 


CRANFIELD 


1 


Initial value of m, dv rank 
cutoff is taken as 500, 
having between determined 
from the dv-dv rank curve 
(rig* 2), such that the 
slope of curve at the 
cutoff falls below 
e s= 0.0000^4. 

In addition, the values 
of 200 and 500 for m are 
tried- 


Calculating from the dv- 
dv rank curve similar to 
as is done for the MEDLARS 
collection, initial value 
of m 500 is obtained. 

The additional value 
of m ^ 300 is used. 


2 


For high and intermediate 
recalls, m = 500 gives the 
best results and gives 
performance consistently 
better than that obtained 
by regular feedback. The 
increase in precision is 
as much as 0-08 at any 
recall level. 


For both values of m 
tried, the retrieval per- 
rormance is no t dbx l er 
ttian that for the FDBK 
iteration. 


3 


At low recall values, m = 

formance being within 
0.04 in precision com- 
pared to the regular 
feedback curve. 


ra = 500 gives the best 

regular feedback curve - 
For this value of m, 
average number of con- 
cepts in the modified 
fqv's is 32 compared to 
114 in the regular fqv's. 




The results are much 
worse for m = 200, sup- 
porting the theory that 
the deletion of good 
discriminators spoils 
the retreival performance. 


At m = 500, the perform- 
ance is worse by upto 0.04 
In precision at any 
recall level. 


b 


Overall, m = 500 gives 
the best results being 

the FDBK's performance 
while the average number 
of concepts in the mod- 
ified queries is just 19 
compared to 109 of FDBK 
queries 


Overall performance is a 
little worse than the 
ppfful^ii? feedback Der>'Foi?«« 

mance. 
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SERIAL 
NO. 


COLLECTION NAME 


MEDLARS (Fig. 9(d)) 


CRANFILLD 


1 


The value 30 for n is 
used, as this gives the 
best performance for 
Strategy II. Similarly, 
the values 500 and 800 
for whxch gxve the 
best performance for 
Strategy III^ are used. 


This experiment is not tried 
beca- -e observing the 
resuxts obtained for the 
MEDLARS collection, it is 
expected that the performance 
of this strategy would be 
almost equivalent to Strategy 
III with only a very minor 
loss in performance* 


2 


The results obtained are 
similar to those obtain- 
ed for Strategy III, 
Deing onxy a xxttxe 
worse • 


■ — 


3 


The best results are 
obtained for m - 800, 
n = 30, when average 
number of concepts in 
the modified fqv^s is 
21 as opposed to 109 in 
the regular fqv's. 





Performance Analysis of Strategy IV in 
Comparison to the Regular Feedback Performance 

Table 5 
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reduction is around 70% for the MEDLARS collection and 
around bc;, for tlie CRANFIELD collection* 

The increase in precision at any recall level is up to 
0.12 for the MEDLARS collection while for the CRANFIELD 
collection, the performance results obtained with elongated 
and shortened fqv*s are equivalent. 

iii) Strategy III 

The use of this strategy for trimming the fqv*s has 
^'esuli"ed in an improvement in performance up to 0.08 in 
precision at high recalls, compared to the FDBK curve for the 
MEDLARS collection. On the other hand, similar experiment 
on the CRANFIELD collection, shows a consistent loss of 
precision, being as much as 0.04. It should then be inferred , 
for the time being, that use of thl^i strategy may give 
results anywhere in the range of a slight loss of performance 
to an appreciable improvement in performance, the benefit 
being that length reduction is from 70-80%. 

In the case of both the collections, the initial values 
of dv rank cutoff, m, arc chosen from the respective dv-dv 
rank curves by comparing the slope of the curve to the valut: 
of £ (here O.OOOOH). In both the cases, m, derived this way 
happens to be around 500 and this value of m gives the best 
results in both the cases. This supports the calculation of 
m from the dv-dv rank curve in the suggested manner. 

iv) Strategy IV 

As expected, the application of this strategy for reduction 
in the length of the fqv*s results in a minor loss of 
performance compared to that of the previous strategy (Fig. 9(e)). 
Suggested values of m and n are the ones calculated for 
Strategy III and Strategy II respectively. 
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C) Comparison of the Four Strategies 
The comparison of the four strategies is done in thrv- > 
different ways, in the following manner: 

i) P-R Curves 

Fig. 9(e) shows the comparison of the best P-R 
curves — one obtained for each of the foui^ strategies , 
for trimming the fqv's uning the MEDLARS collection. 
Fig. 10 shows the corresponding comparison for the 
CRANFIELD collection. The deductions from these 
figures are given below, separately for the two 
collections. 

a) MEDLARS Collection 

1) Strategy III with m = 500 is the overall 

best. 

2) For this best strategy, at intermediate and 
high recalls, the precision is better than that for 
the regular feedback by as much as 0.08 at any 
recall level. 

3) At very low recall levels this strategy 
gives a little worse performance than that for the 
regular feedback. 

4) The average number of concepts in the 
shortened fqv's, using Strategy TII is only 19 
compared to 109 in standard feedback query vectors - 
a reduction of approximately 80% in the length. 

5) Compared to this. Strategy I ( Murray 
method) with 22 as the average number of concepts 
in the reduced length fqv's, that is, with 
approximately same 80% reduction in length, gives 
consistently worse performance compared to the 
regular feedback. The precision is worse by as 
much as 0.06 at any recall level. 
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0) for approximately 70% reduction in length, even 
Strategy 11 ^.erforms better than Murray's method. 

b) CRANFIELD Collection 

1) Among vectors with 50% concepts chopped off. 
Strategy II with n = 52 and average number of concepts 
equal to 51 is the best along with the almost equivalently 
performing Strategy I with 50% concepts removed, that is 
with average number of concepts equal to 57 (Fig. 10). 

2) With 70% reduction in the length of the vectors. 
Strategy I (n = 30 and the average number of concepts =32) 
and Strategy III ivi = 500 and the average number of 
concepts = 34) are almost equival-jnt in performance, though 
Strategy I has a slight edge over the latter (Fig. 10). 

3) All the strategies tried give a little loss of 

up to 0.0^+ in precision but the average number of concepts 
is reduced by 50-70%. 

ii) Overall Performance Indices 

Here, four overall performance indices are compared 
for the best P-R curves obtained. These indices are Normalized 
Piecis5on, Rank Recall and Log Precision 17J . Table 6 gives such 
comparison for the two collections. The figures obtained, 
substantiate the conclusions of the previous sub-section (i). 

iii) Individual Query Behavior 

a) Necessity of the Analysis 

The analysis done so far lias clearly established that 
for the MEDLARS collection, the reduced length fqv's 
give superior performance compared to the regular feedback 
with its elongated fqv's. Thus, one gets the advantages 
both ways — smaller search costs because of the reduced 
length vectors and better retrieval performance. Moreover, 
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\ TYPE 

^ \ SEARCH 

R E \ 
FA \ 
0 SURE \^ 


ORIG QUERY 

V 1 W£\o 


1st FDBK 
VECTORS 


REDUCED LENGTH 1ST FEEDBACK QUERY VECTORS 


STRATEGY 

I 


STRATEGY 
II 


STRATF.C.Y 

T T T 

ill 


STRi\TEGY 
IV 


n=30% 


n=30 


m=500 


m=500 ,n=30 


NORM RECALL 


0.bl>63 


0.7646 


0.7134 


0.7502 


0.7 7 '.14 


0.7b<J4 


NORM PRECI- 














SION 


0.5898 


0.7301 


0.6312 


0.7234 


0.74J7 


0./333 


RANK RECALL 


0.1399 


0.2513 


0.2278 


0.2354 


0.2335 


0 . 2377 


LOG PRECI- 














SION 


0.4837 


0.6165 


0.5837 


0.6205 


0.6268 


0.6272 



(a) MEDLARS Collection 



\ TYPE 

X 






REDUCED 


LENGTH 1ST FDBK QUERY VECTORS 


^ \ SEARCH 

R e\ 
F A \ 


ORIG QUERY 
VECTORS 


1st FDBK 

QUERY 

VECTORS 


STRATEGY 
I 


STRATEGY 
11 


STRATEGY 
III 


STRATEGY 
IV 


0 SURE \ 
RMANCE \ 




















n=30% 


n=50% 


n=62 


m=500 




NORM RECALL 


0.7496 


0.8178 


0.7997 


0.7953 


0.8216 


0.7933 




NORM PRECI- 
SION 


0.6125 


0.7359 


0.7178 


0.7180 


0.7384 


0.7172 




RANK RECALL 


0.2048 


0.3312 


0.3199 


0.3249 


0.3308 


0.3261 




LOG PRECI- 
SION 


0.4330 


0.5712 


0.5596 


0.5537 


0.5689 


0.i5'J8 





(b) CRANFIELD Collection 



Comparison of the Performance Indices for the 
MEDLARS and the CRANFIELD Collections 



Table 6 



Murray's method (Strategy I) is worse with a poorer 
retrieval p^formance. 

On the other hand» the comparatively worse 
performance reflected by the P-R curves for tlie 
CRANFIELD collection for Strategy III is somewhat 
unexpected. However another surprise occurs ^ when 
one looks at the Table in Fig. 10 and finds that for 
the whole query collection, the total number of 
relevant retrieved documents among the top 10 ranks 
compared with those among the top 30 ranks are 
405/574 for Strategy III and 392/560 for Strategy I 
both for approximately 70% reduction in the length of 
vectors, compared to 406/567 for the regular feedback. 
Th^efore, if one considers the total number of 
the relevant retrieved dociiments as a criterion for 
the better performance. Strategy III really is the 
better of the two. The use of the total number 
of relevant retrieved documents as a criterion of 
retrieval perfoinnance is justified when one considers 
that from the user's point of view, the nxjunber of 
relevant among the retrieved documents is the more 
important thing; the rank of a relevant among the 
retrieved documents is a secondary consideration. 
Yet, the ranking of the relevant documents among the 
i^trieved affects the P-R curves to quite an extent, 
making the true evaluation a little difficult. 

This situation is further complicated by the 
undesirable "ranking effect" during feedback 18]. 
Most of the relevant documents used for positive 
feedback improve their ranks after the feedback 
iteration. This helps in improving the P-R curve 
which shows a better performance, even though from 
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the user*s point of view, no improveiuent has taken 
place • This is so because he has already neen these 
relevant docximents and it does not give him any new 
information if these very documents improve in their 
rankings. We will give the name "positive rankinp, 
effect" to such a "ranking effect 

In addition, "negative ranking effect" could 
also occur. Suppose that after feedback, the relevant 
retrieved documents tha. the user has already "seen" 
decrease in ranks. This could happen, for instance, 
when relevant documents are located in two distinct 
regions of the document space and when one group is 
retrieved, the other is not. This results in a 
decrease in retrieval performance shown by the P^-R 
curve, even though from the user's point of view it 
does not. Again, he is not concerned with whatever 
happens to the ranks of the documents he has already 
seen. 

Ail this calls for a more detailed analysis of 
the ix^c.ividual query behavior, so that a truer 
picture of the actual improvement in retrieval 
performance can be formed. 

CRANFIELD Collection Query Behavior 

1) Effect of Various Concepts on Retrievability 

Table 7 shows the effects of various concepts 
on retrievability of the individual queries for the 
CRANFIELD collection. From the whole query 
collection, three typical queries, called of 
Typ^ A, B and C respectively, are chosen as 
examples. Using queries of Type A, the performance 
of Strategy III is better than that of Strategy I 
while the reverse is true for queries of Type B. 
Furthermore, queries of Types A and B show how 
the presence of good discriminating concepts and /or 
the absence of poor discriminators helps in the 
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improvement of retrieval performance. This is 
a pleasing result because it supports the hypothesis 
used as a basis for this project. However » there are 
comparatively very few queries of Type C, which 
shows that sometimes concepts with numerically high 
dv raiiks (relatively poor discriminators) are 
more important in improving retrieval performance 
than the ones with numerically low dv ranks 
(relatively good discriminators). One can conclude 
that even though the theory of dv's is not completely 
foolproof, yet in the majority of cases it decides 
the state of affairs. 

"Ranking Effect" and Retrievability 

To determine the influence of the "ranking 
effects" — both positive and negative, the 
documents retrieved by queries for the best case 
(that is, for the best parameter) of each of the 
strategies tried, are examined query by query for 
the whole of the query collection. The procedure 
used has been to note the ranks of the relevant 
documents retrieved in the ORIG iteration 
the top 10 ranks (denote the set of these documents 
by J). As these are the documents used for 
positive feedback (in the present series of 
experiments) and thus have already been "seen" by 
the user, the ranks of these same documents of 
the set S are observed in the retrieved documents* 
obtained for the other iterations. If the ranks 
of these documents have increased rnmpared to 
their corresponding ranks in the retrieved 
documents of the ORIG iteration , during any of the 
regular or a modified feedback iteration, the 
resultn for this iteration suffer from the **j'Ositive 
ranking affect", that is the performance has been 



overestimated compared to whit it should be. 

On the other side, if the rank of any document 
of the set S decreases compared to its rank in the 
retrieved documents of the ORIG iteration, during any 
regular or a modified feedback iteration, the results 
for this iteration have the influence of the "negative 
ranking effect". In other words, the performance has 
been underestimated compared to what it should be. 

Table 8 depicts the documents retrieved by tv:o 
typical types of queries, called Type D and Type E, 
affected by such biases. For both the queries, the 
performance obtained for the TDBK iteration does not 
suffer from any appreciable ranking effect, that is, , 
the documents of the «et S basically retain the same 
ranks among the top 10 in the FDBK iteration as in the 
ORIG iteration's retrieved documents. For the query 
of Type D, comparatively the Strategy III retrieval 
results get underestimate while for the query of Type 
E, <^ompaiMtively the Strategy I results get under- 
estimated. Since, the Type D queries are approximately 
20 in number compared to approxlniateiy U of Type E 
queries in the whole collection, it is concluded, that 
the Strategy III results remain somewhat underestimated 
compared to the Strategy I results. 

MEDLARS Collection — Query Behavior 

The effect of the various concepts on the retrieval 
np^rformance for the MEDLARS collection is found to be 
similar to the findings for the CRANFIELD collection. 

It is, however, surprising to find that for the 
MEDLAP.n collection, the queries of Type E are absent, while 
there are at least 5 instances of the queries of Type D. 
Thus, the J^trategy III results remain underestimated compared 
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QUERY 
TYPE 



STRATEGY 
RANK^ 

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 

11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
>30 

R 
E 
M 
A 
R 
K 
S 




Negative "ranking effect" is 
more pronounced for this query 
using Strategy III than using 
Strategy I. 

Thus comparatively the 
Strategy III results get 
underestimated. 



I 


ORIG 


III 


270R^ 


215 


270K 


340 \ 


91 7 


j J JO 


245 \ 


269 


L 345R 


237 \ 


r345R--; 


T 226 


238 


S^/266 / 


228 


372 


V 265 / 


269 


227 


^268 / 


227 


226 


2V / 


321 


17 


/ 264 


338 


18 


/ 247 
/ 


328K 


16 


f 369R 


243 


242 


246 


241 


244 


371R 


324 


246 / 


338 


320 


225 / 


238 


369R 


267 / 


370R 


238 


228 / 


240 


325 


243 / 


242 


245 


339 / 


245 


217 


268 / 


337 


333 


37 OR / 


267 


313 


374 / 


336 


225 


217 / 


374 


330 


241 / 


216 


339 


344 / 


237 


323 


341 / 


342 


334 




335 


314 


345R 


344 


322 


373 


327R 


326 


37 IR 


243 


312 



Negative "ranking effect" is 
more pronounced for this query 
using Strategy I than using 
Strategy III. 

Actually only positive 
"ranking effect" occurs using 
Strategy III. 

Thus comparatively the 
Strategy I results get under- 
estimated 



Demonstration of Positive and Nogcitive "Kankinp, Lf foots" 
Occurring during Document Retrieval by i'eedback Qui^ries 
for the CKANFILLD Collection. 



to the Strategy I results for the MEDLARS collection also* 
In spite of these odds^ the Strategy III results have been 
better. 

D) Overall Comparison of the Four Strategies 

Comparison of the vector trimming strategies by the P-R curves shows 
the Strategy III to be superior for the MEDLARS collection it gives a 
performance better than the regular feedback while the average number of 
concepts in the modified fqv*s are just 20% of the original length of the 
regular fqv^s. Strategy I ( Murray *s method), on the other hand gives con- 
sistently worse performance. This is true in spite of the fact that the 
strategy III results get underestimated in comparison to the Strategy I 
results as shown above. 

For the CRANFIELD collection, the performance shown by the P-R 
curves for Strategy III is worse than the regular feedback performance though 
almost equivalent to the performance for Strategy I. But, the individual 
query behavior and the actual total relevant retrieved documents (among the ^ 
top 10 ranks / the top 30 ranks) for the whole query collection using 
Strategy JII (405/574) compared to those for Strategy I (392/560) and the 
regular feedback (406/567), show the Strategy III performance to be almost 
equivalent to the regular feedback performance. 

This shows that Strategy III may be used for reducing the lengths 
of t'le elongated fqv's with almost equivalent or better retrieval performance, 
while Strategy I always results in some degradation of performance compared 
to that of the regular feedback. Whenev^ vectors of fixed length are 
desired. Strategy II or Strategy IV may be used — the latter being a little 
better. 
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One remarkable point about all the strategies is that a 
reduction in the lengths of the fqv'a by even 80% maintains tlie 
performance much closer to the regular feedback (FDBK) rather 
than to the original (ORIG) iteration. 

8. Discussion. 

In view of Murray's work on the construction of superior 
profiles 12] and in view of the present study, a few interesting 
questions arise concerning the initial indexing process and the 
application of the length reducing strategies used in conjunction 
with the various indexing methods. The purpose of this Section is to 
discuss such problems and propose some solutions. 

Murray suggests using a shortened frequency ranked profile 
and finds that using a few broad weight categories, typically four, 
gives performance equivalent to that obtained by using a larger 
number of weight classes as in the standard weighted vectors. 
Even though Murray's work concerns the profile vectors, it would 
be expected that his results could be carried over to the indexing 
process for the vectors in general. Some experiments in document 
retrieval, using Murray's ideas, might prove the validity of this 
assumption, which is made in the following discussion. 

A) Shortened Frequency Ranked Vectors 

Murray's construction of the shortened frequency ranked vectors 
is performed by first forming the frequency ranked vectors and thfen 
deleting the lowest weight c ncepts. It is easy to see that Strategy 
III, in addition to Murray's Strategy, can also be used for reducing 
the lengths of the frequency ranked vectors. This is so, because 
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the only difference between these and the standard vectors is in the 
formation of the c-w pairs to represent the vectors; there is no 
difference in the structure of the vectors* 

B) A Few Categories of Weight Classes 

An interesting question is: which length reducing strategy should 
be used when considering the vectors using only two or four weight classes? 
Here, using Murray strategy of chopping off low weight concepts could wipe 
out SOTie lower weight classes completely. If only one weight class is left, 
•••he resulting vector is equivalent to an unweighted vector (multiplied by a 
constant), which, according to Murray, could result in decreased retrieval 
performance. In such a case, a length reducer like Strategy III could be 
useful. This would, hopefully, retain all the original weight classes — 
the idea being that it gives a chance of "survival" to the lower weight 
but good discriminating terms to show their worth in subsequent retrieval 
operations, e.g. during feedback. 

C) Use of Negative Dictionaries 

The best adjudged method for reducing the lengths of the fqv's is to 
retain the best discriminating concepts in each vector above an appropriately 
chosen cutoff m (Strategy III). Deleting the poor discriminators from each 
vector suggests the use of negative dictionaries. One idea that occurs is 
to initially index the documents and the queries utilizing Crawford's [4] 
negative dictionaries which use the same dv rank cutoff m as the one 
determined for use in Strategy III. In that case, since no vector can have 
concepts above dv rank cutoff m at any time, there is no need for using 
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length reducing Strategy III (or any other) at any stage of the 
retrieval processing* Tliis would save computation time in addition 
to usual savings in storage and searching costs. The aim here is 
to examine the usefulness of this idea* 

The above possibility is depicted by the flowchart ii* rig. 11(a). 
On the other hand* Fig» 11(b) shows both the use of a negative dictionary 
with a dv rank cutoff m' at the indexing stage and the use of a 
length reducer with a smaller dv rank cutoff m to shorten the 
length of the fqv's* 

In one of his experiments using the MEDLiRS collection and a 
negative dictionary construction algorithm, Crawford has studied the 
effect on the retrieval performance of deleting poor discriminators 
from the document and the query collections. He concludes that for 
the MEDLARS collection* a dv rank cutoff of =: 1000 for the negative 
dictionaries is the best in the sense that there is very little change 
in any of the performance measures for the dv rank cutoff between 1000 
and 5940, while the performance decreases sharply by deleting the 
concepts below the dv rank cutoff of 1000. 

In this project, a dv rank cutoff of m 500 was found 
optimal for using Strategy III to reduce the lengths of the fqv*s. 
It seems that for the same collection, m would be <, m*. If 
m = m' , then using this cutoff in the negative, dictionaries avoids 
the use of Strategy III for subsequent length reduction of the fqv*s 
(Fig. 11(a)). If m'>m, a comparison between the performance must be 
obtained for the processes shown by Figs. 11(a) and 11(b). We examine the 
two possibilities specifically for the MHULARS collection and take 
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m 5 500 and m* = 1000 based on previous experiments. 

1. '.rawford's results show that the retrieval performance 
oLtain^^d by using the dv rank cutoff of 500 in the negative dictionary 
(Fig. 11(a)) 1:5 appreciably wor .e than the retrieval performance' 
obtained by using the corresponding cutoff of luUL (Fig. 11(b)). Thin 
implies that after the original document search of Fig. 11(b), thf> 
relevant documents retrieved would be more in number and/or have ranks 
higher (numerically lower) than the corresponding relevant documents 
retrieved using steps in Fig. 11(a). Thus the fqv's formed by usinp. 
positive feedback at stage R in Fig. 11(b) would, in general, ho 
better formed and be composed of comparatively more concepts above 
dv rank of 500 (the uest discriminating concepts) than in the fqv's 
at stage S in Fig. 11(a). 

Moreover, trimming the fqv's of stage S by a length reducer 
employing a dv rank cutoff of m = 500 would rf>tain all those concopt r; 
above dv rank 500 in reduced length fqv's (at stage T in Fig. 11(b)) 
that are present in the elongated fqv's originally. Thus the number 
of the best discriminating concepts (the ones above dv rank 500) in 
the reduced length fqv's at stage T (Fig. 11(b)) would be more than 
in the regular fqv's at stage R (Fig. 11(a)). Because the concepts 
with dv ranks above 500 are the ones that most affect th*» retrieval 
performance (Fig. 9(c)), this means ttiat porformance of the process 
depictrd in Fig. lt(b) should be better than or at least equivalent to 
that of Fig. 11(a). 
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2. Secondly > using a smaller dv rank cutoff of 500 at the 
indexing stage could completely zero out some queries at stage P in 
Fig, 11(a)* while the probability of this happening at stage Q in Fig. ll(b> 
after using a dv rank cutoff of 1000 in the negative dictionary is 
comparatively smaller. Furthermore, the query that gets wiped out at 
stage P in Fig. 11(a) has a lesser chance of getting zeroed out after 
reducing the length of the corresponding fqv at stage T in Fig. 11(b). 
The reason is that the fqv formed at stage S would, in general, have 
more concepts above dv rank 500 compared to the original query vectors 
of stage L. 

The conclusion is in favor of using the processing steps of 
Fig. 11(b) rather than those of Fig. 11(a). In other words, the performance 
would be better by using a relaxed dv rank cutoff in the negative dictioraries 
at the indexing stage followed by a subsequent length reduction of the 
vectors using a stricter dv rank cutoff rather than using the stricter 
cutoff in the negative dictionaries while avoiding to use any lengrh 
reduction of the vectors later on. The exact tradeoffs between the improve- 
ment in retrieval performance and savings in computation time should be 
further investigated experimentally. 

D) Ideal Indexing 

An ideal indexing would be the one in which the weight of a concept 
is a true indicator of the worth of the concept with respect to the particular 
document collection and with respect to other concepts in the pcjae vector. 
An indexing method, it seems, should consider the frequencies and the 
distributions of all the terms in the collection. In this regard, use o^ 
the dv of a term by the indexing process is important , because dv of a term 
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not only depends on the frequency and the distribution of that 
term, but also on the frequencies and the distributions of all 
the other terms in the collection [UJ. 

If such an indexing scheme is used, then Strategy m for 
reducing the vector lengths would not oe appropriate. The reason 
for this is that the term dv's have already been used in determining 
the true overall weights of the concepts in the particular context, 
and therefore it would seem more reasonable to chop off the low 
overall weight (and thus truly unimportant) terms rather than to use 
the dv»s of the concepts again to help in reducing the vector lengtlif.. 
In such a case, Murray's method (Strategy I) which eliminates the 
low weight concept.', from elongated vectors, would be a good choice 
for reducing the vector lengths. 

9. Summary and Conclusions 

This study has made an attempt to use the dv's of the concepts 
to help in 

a) reducing the lengths of the elongated vectors (the fqv's, 
in particular), and 

b) the reinforcement of feedback. 

Questions that arise in view of this work and the previous 
work in indexing, especially that by Murray [2], are examined. 
The specific conclusions are: 
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i) Strategies for controlling the lengths of the fqv's and 
based on using the knowledge of the dv^s of the concepts 
have been foxrnd quite successful. 

ii) In particular, a length reducing strategy based on retaining 
the top best discriminating concepts only (Strategy ITT of 
this report )» has been adjudged to be the most successful. 
This results in reducing the lengths of the vectors by 70-80%, 
thus saving on the storage and searching costs while at the 
same time attaining bett^ or almost equal performance than 
that obtained with the elongated fqv's of Rocchio-type feedback. 

iii) In comparison, Murray •s strategy results in decreased performance 
than the regular feedback for the experiments conducted. 

iv) When fixed length query ve.tors are desired after trimming. 
Strategy IV is useful and it gives only a minor loss in 
performance compared to the best adjudged Strategy III^ 

v) An interesting fact is that, the use of any strategy for 

80% length reduction of the fqv's results in retrieval performance 
being much closer to the regular feedback with elongated vectors 
than to the performance of the original iteration without feedback. 
This means that all the strategies tried are good in the respect 
that they retain much of the benefits of the regular feedback 
with at best a comparatively little loss in performance. 

vi) It is observed that Strategy III could be useful wh^n used along 
with Murray's proposal of using only a few, typically four, 
broad categories of weight classes. 

vii) It is shown that the retrieval performance is better using a 

relaxed dv rank cutoff in the negative dictionaries during initial 
indexing stage followed by a subsequent length reduction of the vectors 
employing a stricter dv rank cutoff rather than using the stricter 
cutoff in the negative dictionaries while avoiding to use any 
length reduction of the vectors later on. 
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One should note that for a large-scale application of 
Strategy III as a length reducer^ many of the experimental dotailr. 
would be saved if all the concepts occ;jp?ring in the document and 
the query collections are renumbered in their decreasing dv order* 

A final remark is that the results of this project tetul to 
support the view that the dv's and the frequencies of the concepts 
are important in determining the indexing process and should play 
a joint role in the design of an ideal indexing scheme. 
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The Shortening of Profiles on the Basis of 
Discrimination Values of Terms and 
Profile Space Density 
Marc A, Kaplan 

Abstract 

The problem of long profile vectors which naturally arise 
in a clustered file environment is discussed, A method to reduce 
profile length is suggested and tested. The method involves 
eliminating those concept weight pairs whose concepts have poor 
discrimination values as defined previously by Bonwit and 
Aste-Tonsmann. 

1. Introduction 

Any information retrieval system which is to operate on a 
large data base and which is to provide on-line service to users 
must be designed to provide reasonable response time for the user. 

In a SMART* like [1] environment where documents and queries 
are represented as vectors of ntambers and a query-document co3?relation 
function must be computed to evaluate the degree of similarity 
between each query and docximent one cannot hope to compute all of the 
correlations between a given query and every document in the 
collection and still give the user reasonable response time* Therefore^ 
one seeks strategies which will reduce the number of correlitions 
which must be computed before the retrieval system can produce an 
"answer". 
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One such strategy is to use a clustered file organization for 
the document space* in this organisation whole groups of documents » 
called clusters ♦ are represented by one pseudo-document ^ called the 
centroid or profile of the cluster. The centroid of a cluster is supposed 
to be representative of all the documents in its cluster. Thus the 
system need only correlate a query with all of the centroids in the 
document space and then with all of the documents under those few centroids 
which seem most promising rather than with each and every document in the 
entire collection. 

One problc^m with clustered files is that the profile vectors are 
often unreasonably long» that is» the profile vectors contain many non- 
zero entries. Since in a usual implementation only the non-zero terms 
are stored, long vectors rf^quire more storage than vectors with few non-- 
zero concept weights* (Typically vectors are stored as lists of non-- 
zero concept-weight pairs rather than as a list of ordered weights, one 
for each possible concept number. 

These long profile vectors occur because the profile, P, of a 

cluster is usually given by a formula such as: 

(i) P - I D./n. 

i ^ ^ 

where i ranges over each document in the cluster. 

th 

D. is the i document vector 
1 

n^ is some non-zero normalizing factor 
Thus the profile of a given cluster has as many different non-zero concepts 
as there are distinct terms in the whole cluster. 
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Not only do all of these non-zero terms require considerable 
storage but they must also certainly increase the time required 
to .,o..pute a correlation value. This is so since all the concept- 
weight pairs in the profile vector must be read into the computer 
and then checked against all of th^ --oncepts in the query vector. 
Even if none of the concepts match, the given correlation algorithm 
must at least search through the long profile vector to determine 
this. 

So one must naturally ask if there is any way by which one 
can "shorten" the profiles of a clustered collection of documents 
without app- iciably degrading the performance of the system with 
regard to recall and precision levels. If this coulv be achieved 
one would have a better retrieval system than the original system 
with long profiles since storage and CPU time would be conserved and 
the user would be served faster. 

Experiments made by Murray [2] suggest that one can indeed 
shorten centroid vectors merely by deleting some of the concept- 
weight pairs. He states: 

"... profiles r.-m be subjected to considerable deletion of 
low weittht .^c-s frequent) terms with little change in the 
quality teu'oh output... Experiments... indicated that 
the deletion ox B0% of the lowest weight terms drops the 
'.C-PF only to 5%." 

Murray »s approach is local. He examines each profile Individually 
without considering the others and deletes those terms of lowest 
frequency. Might not some high frequency terms also be good 
"vbjcots fcr deletion? Murray says: 

"On tne other hand, an attempt to remove or combine related 
occui -ences of high weight profile terms results in much 
poorer performance, such procedures are to be avoided." 



In this paper the aa^eriment is based on a different approach 
towards finding the appropriate terms to delete from the profiles of a 
collection. What is desired is a technique for finding terms which are 
not important or may even be detrimental in computing query-cetitroid 
correlations. The experimenter does not wish to prejudge that terms 
should be deleted simply because they are of high, low or medium frequency. 
Nor does he want to guess hew many terms should be deleted or kept. What 
than is ue to do? 

Bonwit and Aste-Tonsmann [3] have conducted research in the 
constructiai of what they call "negative dictionaries", that is lists 
of words which are best not considered as concepts in a collection of 
docum^ts; which are best deleted from document vectors. One- may ask 
whether their technique might not be applicable to ^:he present problem? 

* 

2. Density and Discrimination 

Bonwit and Aste-Tonsmann define document space density Q as: 

(ii) Q = F I cosCP.D.) 

j=l J 

where N = number of documents 

Dy the j document vectca? 

P = centroid of dociaments given by (i) 
with 5? N for all i 

cos is the usual cosine correlation function 
as used in the SMART system 

Now define vector V"^ as vector V with the set I of terms 

deleted (that is, set to zero). Then the density of the document 
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space with the set I of texros deleted is defined as: 

<iii) Qj - |. I cos(p\dJ) 

where k ranges over the set of integers for which 
Dj^ is not identically the zero vector 

N* = number of documents for which is 
non-zero (usually for small sets I, 

The discrimination value, a^^, of term k is now defined to be 

(iv) aj^ = 100 * (q^ - Q)/Q where K s {k} 

The greater the value of aj^, the better the discrimination value of 
term k is said to be. Intuitively if aj^ is large and positive then 
the deletion of term k from the document set caxases t..«. space to 
"contract", thus term k is thought to be a good term, necessary to 
help distinguish one document from snother. If aj^ is close to 
zero then the deletion of term k should not affect the document 
space significantly at all. Finally if aj^ is negative then term 
k is a bad discriminator, its deletion will almost certainly 
improve the document space I 

What one seeks is to find the optimum deletion set of terms, 
I. The hypothesis is made that that set I for which is 
minimized shotad be the "best" set of terms to del«te. 

In their experiments Bonwit and Aste-Tonsmann attempt to 
construct set I in the following way; 

1. Compute a. for each term k. 
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2. Construct an ordered set K=(k^,k2 • • ,k^)» such that 
a^^ i i ^ 1,2 ^•••^t-l^ where t is the nuihber of 

unique terms in the document space* 

3* Let K =(k ^.,k • • Note that = K and K ^ the 

empty set* is the set of all but the best p discriminators* 

Now one assumes that in a good deletion set, I, one should 

want the wcrst discriminators, that is one only wishes to keep 

the best, say p, discriminators as terms in the document 

space* Thus the problem of finding which of the 2* subsets 

of K to use as the deletion set is reduced by the above 

assumption to finding which of the t subsets of the type 

K to use as a deletion set* 
P 

5* Find p to minimize Q^. * Then I = K is the desired deletion 

P ^ 
set* ^ 

Bonwit and Aste-Tonsmann found that for the particiilar collection 

with which they worked that K was indeed a good deletion set* Retrieval 

P 

performance as measured by nornnalJzed recall was actually improved by 
deleting the set of terms K^* In fact, they found that for a sequence 
of document spaces, each given by deleting more and more of the "bad" 
discriminators, normalized recall was greatest almost at the point where 
was minimal* 

P 

One should notice that the deletion set K as computed above 

P 

does not depend on any parameters external to the document space itself* 
There is no need to choose any frequency cutoff value nor is it necessary 
arbitrarily to decide the percentage of the original terms which one 
wishes to keep. 
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3. Experimental Design 

The discrimination value approach is applied to the problem 
of long vectors by considering the set of profile vectors, apart 
from the rest of the collection, to be a document s-^ace. The 
algorithm given in part 2 is then applied to the set of profiles 
in order to compute the hopefully "optimum" deletion set K^. 

Existing FORTRAN programs [4] were modified by the experimenter 

so as to automatically compute the values Q„ for 1 = 0,1,..., h, 

1 

where h equals the number of positive discriminators. The smallest 

•^alue of 1 which minimized Q was considered to be p, the v^ptiraum 

1 

number of positive discriminators to retain in the collection of 
profiles . 

Having computed p the profile vectors were then modified 

by the deletion of all but the best p discriminators. 

(It should be pointed out that the procedure actually used 

in computing the values, Q^^ , is not precisely the same algorithm as 

1 

given in part 2, although the results which are obtained in using 

the FORTRAN procedure are believed to be well described by the 

procedure given in section 2. The actual computations that are made 

are done by first computing Q in a straightforward manner, but 

some of the results of necessai^ intermediate calculations are 

retained. Then a simple and quick computation of the values of 

is made by recomputing just a few intermediate values. Once 

the values of aj^ are computed, they are sorted and a note is made 

of how many positive values exist. Now to compute the valu-ss, , 

^1 

the program first computes and saves some intermadic le results. 

^1 
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The successive value of Q^- •••••Q^^ computed one after another 

^2 ^3 \ 

by merely accounting for the effect of adding but one more term to the 
profile space* The interested reader is referred to the program -listing 
if he wishes to see the details of the computation.) 

I'or each collection used in the experiment two sample searches 
were made. The first search was run with the original profile vectors. 
The second search was run with the modified profile vectors. Except for 
the use of different profile collections, the SEARCH routine of the S^1ART 
system, on which these tests were made, was given the same search parameters 
for each of the two runs. SEARCH results were compared using routines 
AVERAGE and VERIFY. 

Two different collections were used in this study, both are 
available on disc packs on the Cornell University 360 computing system. 
The first, the document collection called ADIABTH DOCS was used along with 
its associated TREEl, the original profile collection, and QUESTS, a set 
of queries for which relevancy decisions have been made externally to 
the SMART system (82 documents, 35 queries and 13 profiles). The small 
size of the collection allowed for economical debugging of programs and 
the experimental procedure. Results obtained with this collection were 
encouraging enough to warrant the use of a larger collection for further 
study. ^ 

The second collection used was CRN^iS D0CS(U2U documents), QUESTS30 
(30 queries), and KTREE (29 profiles). The KTREE centroid collection 
was created especially for this study by the experimenter, using the 
CLUSTER routine of the SMART system. (See computer run contained herein 
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for a description of the parameters used to create the profile colltction.) 

This collection was used because it was the largest available collection 

which the experimenter could afford to use that was easily accessible on 
the SMART system. 

H. Experimental Results 

The results of the experiment are promising. As shown in Table 1 
the storage requirements, in bytes, for the profile collections as maintained 
in the SMART system have been cut by about 30%. Also the number of concepts 
in the longest profile in a profile collection is cut by 33% in the case 
of the CRNtfS collection and 17% in the case of the ADIABTH collection. 

Recall and precision performance is meanwhile hardly affected at all. 
In the case of the CRNUS collection recall level averages are almost identical 
for searches using the two different centre id collections, with perhaps a 
slight advantage gained by using the original centroid collection instead of 
the modified collection. However, the results of the VERIFY routine would 
indicate that the difference is likely due to chance. (The overall chi-square 
measure was greater than .9989 for each of the three statistical tests useu: 
t-test, sign test, Wilcoxon test). Document level averages are also non-signif 
icantly different for the CRNHS collection. ( Computer output gives a chi- 
square of 1.0000 for each test.) For the ADIABTH collection examination of 
recall level average curves seem to show a slight trend towards the modified 
tree giving slightly lower precision in the low recall region of the curve and 
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higher precision in the high recall region of the curve ^ as compared to the 
results obtained with TREEl. But the significance tests show that this 
difference is perhaps 50% due to chance* 

Glancing over the term statistics for terms thrown out and terms 
kept one can find terms of relatively high, low or medium frequencies for 
both collections which were either kept or deleted. This holds for both 
document frequency (number of profiles in which the term in question occxrrs) 
and ov^all frequency (summed weight of term throughout profile collection). 
Thus it appears that Murray's suggestion that high frequency terms should 
never be thrown is not necessarily to be followed in the future. The 
suggestion arose because Murray > in only looking at the profiles locally » 
could not tell whether or not a high frequency term was a good discriminator 
or a bad one, while it happens that for the most part low frequency terms 
are usually poor discriminators. The procedure suggested herein however, 
by its very definition, considers terms as they affect the profile space 
globally and hence can distinguish between "good" and "bad" high frequency 
terms. 
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On Dynamic Document Space Modification 
Using Term Discrimination Values 
C.S. Yang 

Abstract 

Brauen*s algorithm for dynamic document space modification 
has been shown to improve retrieval effectiveness. He adds new 
terms to docxunent vectors and some terms in the document vectors 
are increased in weight. In this study, term discrimination 
values are utilized so that only good terms are either added to 
documents or increased in weight. This can keep document vectors 
relatively short and poor terms are prevented from overriding 
good terms. The storage and retrieval effectiveness for a new 
version of the document space modification algorithm is studied. 

1. Introduction to Dynamic Document Modification and Term 
Discrimination Values 

In the SMART system, each document or query is represented 
by a vector. Document vectors may be constructed by elaborate 
manual work or by automatic methods. Manual construction may produce 
better results, but it requires too much hximan effort. The tendency 
therefore is to construct document vectors automatically by utilizing 
document abstracts and a dictionary, where the dictionary itself 
may also be constructed automatically. As a result, the well- 
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formation of automatically constructed vectors is questioned. Besides, 
Brauen U3 states that "even though the vocabulary in many scientific 
fields is essentially standardized, it does not remain constant over 
time. Vocabulary in fact changes with new developments, new personnel, 
and other factors. As a result, document vectors, which are reasonably 
well defined at one time, may appear to be ill-defined five years later. 
Given a group of users knowledgeable in some field and given that these 
users submit a set of roughly similar (Queries , one might then expect 
that similar sets of documents will satisfy all these queries. A 
strategy designed to "group" document vectors about the xiser queries 
to which they are relevant may then aid the retrieval performance of 
similar queries submitted in the future." 

These considerations motivate the desire to modify document vectors 
by user's opinions. Brauen has suggested the following algorithm for 
Eynamic Document Space Modification (DDSH): 

1) An initial query q^ is submitted and processed. Relevance 
feedback iterations are performed until some modified query 
q^^ returns a list of documents satisfactory to the user. 

2) Each relevant document vector identified by the user during 
the feedback process is then modified as follows: 

Let D be a document relevant to q^. 

a) If concept belongs to q^ but does not belong to 

D, then add C. to D with weight = BETA. (1) 

b) If concept C. belongs to both q and D, then modify 

by 

= + GAMMA * (120-D^) <2) 



c) If concept belongs to D but does not belong to 
q^, then modify D. by 

Brauen has shown that this algorit' n does improve retrieval 

performance because documents tei*c. .o center around those queries 

to which they are relevant, [1] But two questions need to be considered; 

1) From equation 1, new terms are added to the document vectors. 
After the document space is modified by many queries, 

is it possible that so many new terms are added to 
the document vectors that the vectors are unreasonably 
long? If so« document storage becomes a serious 
problem. On the other hand, the COSINE correlation 
betwe^ two vectors is a term matching process. It 
takes more time to calculate the correlation between 
longer vectors. 

2) Are all terms in the original queries good ones? If 
this is not necessarily the case, is it wise to judge 
the usefulness of a term which is in the query but not 
in the document before the term is added to the document 
vector? Similarly, is it wise to judge the usefulness 
of a term which is in both the query and the document 
before it is increased in weight? 

With respect to the first question, several observations 
are made based on e3Q>eriments using the CRN4S DOCS document space 
(Wt documents) and CRN4S QUESTS query space (155 queries). The 
results of the eirlier experiments show that many doctjonents are 
significantly increased in length. A typical example is the 
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following: 

Query 4 has 12 concept tenos and document 18 has 76. Query 
4 retrieves document 18 as a relevant document with rank 6. Eight terms 
in Query H but not in document 18 are added to docimient 18, Later on. 
Query 5 also retrieves document 18 with rank 6 as a relevant document. 
S^ven terms are added to this document. Document 18 thus increases 
from 75 terms to 91 tarms after being modified by only two queries. 

Whan a document vector has been lengthened to a certain extent 
ia it possible that relatively few new terms will be added to it? To 
obtain an estimation of this, consider how the longest vector (document 
23if) in CRN4S DOCS behaves. This document has 186 terms. Query 23 
retrieves it with rank 5. It is found that 6 out of the 8 teiros in 
query 23 are not in document 234. If a long vector is lengthened so 
quickly, an originally short vector (short vectors in CRN4S DOCS have 
lengths of about thirty terms) will prob.ably double or triple in length 
after being modified by many queries. 

The above argiaments si^port the desirability of limiting docimient 
lengths. A natural way to do this is to modify Brauen^s algorithm so 
that only good terms are added to documents (by equation 1) and poor 
terms are deleted from document vectors (by equation 2). 

The "goodness" of a term was first studied by Bonwit and 
Aste-Tonsroann 12]. A term is considered to be a good term, i.e. > a 
discriminator, if its distribution is such that it serves to distinguish 
or discriminate among documents in the collection. Otherwise it is a 
non-discriminator. For example, if a term occurs in almost all documents 
in a collection, it may not be a good discriminator since it may have 

little effect in distinguishing among the documents. 

< ■: s 
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A formuJ-etion of a i-erm discrimination value function 
developed by Crawford 133 is introduced as follows: 

Let D^, ...» be a collection of N documents. Each 
document, D^, is represented by a vector = D^2> C^^) 

where M is the number of terms in the dictionary for the collection 
and D. j is the weight of concept j in document i. 

The oentroid vector, C = (C^, . . . , Cj^) , of the document 
collection is defined as 

N 

C.- = I D.. j = 1, 2, M 



" ^ ill 

The Compactness, Q, of the document collection is 

1 ^ 

Q = rr J COSCC, D.), 0 < Q < 1. 
^ i»l ^ 

The Compactness of the document collection with term i deleted 
is 

N 

where C and are respectively the centroid vector and jth 



Qi = ^ Z cos(c\ nh) 



document vector with term i deleted. The Diacpiniiuition Value, 
v., of term i is defined as 

Q? » Q 
^i Q * ioo 

Since a good term can distingxiish among the documents, its 
existence makes the document space more sparse, i.e. , less compact, 
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Then Q* > Q for a good disoriminator C.. 

Conversely » a poor term fails to distingxiish among the documents » 
its existence makes the space more compact* Then < Q for a poor 
discriminator C** 

So> the higher the discrimination value for a term» the better 
discriminator it is. If V. < o for term C. ^ C. is called a non- 
dis criminatory 

The above theory therefore generates a coefficient proportional 
to the usefulness of a term* 



2. Dynamic Document Space Modification Using Term Discrimination Values 
The following modified vev'sion of Brauen*s algorithm is proposed 
to alleviate the problems mentioned before* 



1) An initial query q^ is submitted and processed. Relevance 
feedback iterations are perfomed until some modified query 
q^ returns a list of documents satisfactory to the user. 

2) Each relevant document vector identified by the user during 
the feedback process is then modified as follows: let D be 
a document relevant to q^ ♦ 

a) Concept C. belongs to q^ but does not belong to D. 

If C. a good discriminator then add C. to D 

^ i ^ 

with weight D ^ BETA* 

b) Concept C. belongs to q and D. If C. isa good 

X • O X 

term then modify D by 

. . . 
D a D + GAMMA * (120 - D^) 

c) Concept C. belongs to D but does not belong to q^* 
Modify by 

i i 



A* • • • 
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Poor discriminators in the queries are not added to 
document vectors by the extra condition in (a). Also, poor terms 
in the documents do not gain weight by the condition in (b). 
From the point of view of performance, if a document space is full 
of poor terms with high weights , they will override the effect of 
good terms and hence the retrieval effectiveness will deteriorate. 

3. Experiment 

The following data bases are used: 

Document Space: CRN4S DOCS (424 documents) 

Query Space: CRN4S QUESTS (155 queries) 

Tree Structure: REW-CRN4S TREE (63 first level centroids, 

14 second level centroids, 1 root node) 

Brauen calls two querie.« similar if they have three or more 
terms in common. Otherwise two queries are nonsimilar. He divides 
the 155 queries into two subsets. 125 of them are used to modify 
the document space and are called the Modification Set. Among the 
remainder of the thirty queries, fifteen are similar to some queries 
in the Modification Set and fifteen are not similar to any query in 
the Modification Set. The thirty queries are used to test the 
effectiveness of the document collection modified by the 125 queries. 
His convention is adopted in this experiment. Both sets of queries 
are shown in Table 1. 

A new load module called SMARTDSM has been set up to 
accommodate the document space modification and the Retirement 
policy. The latter is not disciissed in this paper. One can choose 
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to rtm DDSM (with or without term discrimination considerations) and/or 
document retirement by properly setting several parameters. 

Crawford's program calculates the (f^.scrimination value of 
each of the U439 terms in the dictionary for the Cranfield 
document collection. All terms are ordered and renuirbered in 
descending discrimination value so that concept 1 has the highest 
discrimination value and concept has the lowest discrimination 
value. The document-, query-, and tree- structure vectors are receded 
according to this new dictionary. In this new environment, the 
modified Srauen method can be restated as follows: 

1) An initial query q^ is stibmitted and processed. 
Relevance feedback iterations are performed until some 
modified query q^ returns a list of documents 
satisfactory to the user. 

2) Each relevant document vector identified by the user 

during the feedback process is modified as follows: 

Let D be a document relevant to q^. 

• o 

a) Concept C^ belongs to q^ but does not belong 
to D. If C. < GTERMthen add C. to D with 
weight - BETA * (1 + * ONE). If GTERM <C^< 
NTERM then add C. to D with weight - BETA. 

b) Concept C. belongs to q and D. If C. < NTERM 
then modify D by 0 =^0^ + GAMMA * (120 - D^). 

c) Concept- C^ belongs to D but does not belong to q^. 
Modify by 

a >i . ... 
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GAMMA - 0.225 and DELTA = 8 are the optimal values obtained by Brauen 
and are used throughout this study. 

In (a), the parameter ONE can be set to 1 or 0. If ONE = 1, 
more emphasis is put on the very high discrimination value terms added 
to documents. If ONE =0, all terms added are treated equally. If ONE = 1 
and BETA =20, terms with very high discrimination values are added with 
normal weight 30 and other terms added have weight 20. 

The distribution of discrimination values of the terms is 
shown in Fig. 1. The first 407 terms in the dictionary have discrimina- 
tion values greater than 0.01. They are considered to be very good terms. 
The last 80 terms have negative discrimination values. Their document 
frequencies range from 46 to 219, These are therefore very high 
frequency terms. 

Table 2 shows the seven sets of GTERM and NTERM cutoffs used in 
this experiment. The 125 queries modify the document space for each set 
of cutoffs. Tvo iterations are run. Relevant documents in each iteration 
with rank above 30 are modified. If every query modifies six docvnnents 
on the average, there are 6x125=750 modifications, and each document is 
modified 750/424 =1.77 times. The remaining thirty queries then test 
the retrieval performance of the resulting document spaces. The number 
of terms added to the document space is also coionted. 

4. Discussion of Results 

Table 2 sumnarizes the experimental conditions. No document space 
modification is performed for set 0. This is a standard search run used 
for comparison. 
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Seven sets of parameters used in the experiment 
with or without DDSM and Term Discrimination 
consideration 
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There is no cutoff in the DDSM for Set 1. This is the 
standard Brauen method. 2969 terms are added to the 424 documents. 
Each document is lengthened by 7 terms on the average. 

For Set 2, only the eighty negative discrimination value 
terms are deleted. Because of the high frequencies of the 80 terms, 
the number of terms added to the collection is decreased quickly from 
2969 CO 2403. 

Set 6 exhibits the largest cutoff. Compared- with the 
standard DDSM (Set 1) where 2969 terms are added, the number of terras 
=idded is decreased by ( 296^ ^ ^ " ^^'S^* 

Sets 3 and 4 have interramediate cutoff values and 

( 2969 ^ ^ ^^^^ ~ terms are prevented from being added 

to the document collection. 

The seven sets are tested with the similar and nonsimilar test 

sets respectively. Their retrieval performances are shown in Tables 

3 to 6 and plotted in Fig. 2 to 7. The performances can be 

summarized as follows: 

1) Similar queries in the oth iteration . 

The original Brauen 's DDSM (Set 1) shows a considerable 
superiority over all others. Sets 2, 3, 4, 5, and 6 are 
almost the same at all recall levels. They have poorer 
precision than that of Set 0 at very low recall levels 
but are universally better at recall levels above 0.3. 
The significance tests show that the superiority of any 
one of the five sets (Sets 2, 3, 4, 5, 6) to the other 
four sets is not significant. One way conclude that 
these five sets have approximately the same performance 
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which is superior to that of Set 0 but inferior to that of 
Set 1. 

2) Nonsimilar qu^cies in the oth iteration* 

Slight cutoffs (Sets 2, 3, are better than Set 1 at low 
recall levels (below 0.2) and equal to Set 1 at all other 
recall levels. Severe cutoffs seem to degrade the performance. 

3) Similar queries in the first iteration. 

Sets Is 2, 3, ^ and 5 are practically equal in performance. 
The significance tests show that none is substantially better 
than the others. A very large cutoff like the one used for 
Set 6 produces a slight deterioration in performance. Compared 
with the standard search run (Set 0), all the six sets are by 
far superior. 

4) Nonsimilar queries in the first iteration. 

Sets of 2, 3, 4, 5 and 6 have essentially the same performance. 
They are all better than Set 1 at recall levels below 0.6. 

From the above observation* one can reach the following conclusions: 
Except for the similar test set in the oth iteration, the cutoff of poor 
terms seems to maintain the performance of the standard DOSM. In the 
first iteration, the nonsimilar queries show a somewhat better performance. 
But too large a cutoff like Set 6 will tend to degrade performance. As 
a compromise, an intermediate cutoff like Set 3 is suggested. Considerable 
storage can be saved while effectiveness is still maintained. 

The inferior performance of Sets 2, 3, 4, 5 and 6 in the oth 
iteration for similar queries is laiderstandable. For most queries q^, in 
•die similar test set, (except queries 6, 9, 63, 78) one can find one or more 
queries, q^^^, in the modification set with the following properties: 
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(i) and have almost the same relevant document 
set (say, D^, D^* •••» 

(ii) Some negative discrimination value terms (say, C^, 

C^) are in both and but not in the 
relevant documents (D^, ...» D^). These terras 
are very hi^ frequency terms. 

Since no cutoff occurs fcac» Brauen*s original r'?SM, when q^ 
retrieves some of (D^, ...» D^) as relevant documents (say, D^, 
Dj^) and modifies them, (C^^, C^) are added to these documents. 

Afteiwards, when q in the simil.<r test set is submitted, documents 
Oj^, ...» Dj. will have high correlation with q^ because C^, C^, . 

appear in both q^ and D^, D^, Dj^. If a cutoff is used, 

^1* ^2* ^n ^® added to D^, D^, Dj^ when they are 

modified by q^^^. This explains why Brauen*s stand DDSM works 
especially well for similar queries in the oth iteration. 

As mentioned above , terms , C- , . . . , C are very high 
frequency negative discrimination value terms. Some of them even 
appear in half of the documents in the CRN4S DOCS collection, so 
they will pw^ably appear in the queries very frequently. Actually 
Jones 151 showed that, for three independent collections, half of the 
query terms are high frequency terms. (Table 7) If many queries 
are submitted in this experiment each document is modified only 
1.77 times on the average then each document will be modified 
many times, and these high frequency terms will enter into all 
document vectors in the long run. In case this happens, the effect 
of these terms will be negligible. Set 1 will then have roughly 
the same p^formance as set 2 through 5 for similar queries in the 
o*^ iteration. 
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Collections 


Cranfield 


INSPEC 


Keen 


No* of documents 


200 




797 


No. of requests 




97 




No. of tex^s 


712 


1341 


939 


No. of frequent 
terms 


96 


73 


50 


Average No. of terms 
per request 


6.9 


5.6 


5.3 


Average No. of 
frequent terms 
per request 


3.6 


2 


1.8 



Term distribution statistics for three 
independent collections. (The last two 
rows show the ratios of frequent query 
terms) 



Table 7 
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5. Conclusion 

Sets 3 and 4 use NTERM = 2465. From Fig. 2 and Fig. 3, one 
can see that these sets actually achieve the same effectiveness as 
set 1. So, at least 4139 - 2465 s 1974 low discpimination value 
terms do not contribute much to retrieval effectiveness. For this 

value of NTERM. ^ ^^2969^^^ ^ ^ ~ ^® 
eliminated. A reasonable portion of the memoxy storage can there- 
fore be saved. _ _ — - 

This study also shows that the term discrimination value 
concept is acceptable. Slight cutoffs of nondiscriminators do not 
deteriorate retrieval performance. But neither do they improve 
effectiveness* One might suspect that additional work might be 
done in the theory of term discrimination values. Other approaches 
of defining term goodness on a sound theoretical ground is probably 
the most urgent. It is believed that, with a good judgment of the 
usefulness of terms, the proposed version of DDSM should lead to 
better results — both in performance and storage considerations. 
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The Use of Document Values for Dynamic 
Query Document Processing 
A. Wong and A. van der Meulen 

Abstract 

In the field of document retrieval it might be advantageous 
to take into account the utility of documents in the collection; 
that i8» the average usefulness of a specific document for a 
given user population in terms of satisfaction of the users* 
information need. 

In this report the following two questions are investigated: 

a) how to improve system performance by assigning so-called 
utility values to each document, and 

b) how to base a document retirement policy on those 
quantities. 

A feedback environment provides the possibility for 
automatically creating a list of utility values. These values are 
then based on the retrieval history of a document. That is, for all 
the queries for which a particular document is retrieved, user 
judgments about its relevancy are utilized to compute a quantity which 
reflects the usefulness of a document for the collection and its users. 

Utility values may then be used during the retrieval process 
to promote the retrieval of satisfactory documents and suppress the 
retrieval of obsolete and mediocre documents. Another application 
of the availability of document quality values is a document retirement 
policy based on the utility value score in the collection. Documents 
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with low utility values may be placed in an auxiliary file to keep 
the main file more current and up to date. 

1* Introduction 

The pwformance of a document retrieval system is generally 
evaluated by means of two well known system parameters; recall and 
precision, which are based on the average user judgments about relevancy 
or non-relevancy of the retrieved documents. 

Keeping track of user decisions might be quite beneficial for 
future users since provision can be made for detecting the quality of 
the judged documents. ^ automatically creating a new document property 
called the utility value, which represents the average judgment of the 
user population about a given document, it seems plausible that system 
performance can be improved by using this value in the retrieval process. 
Out of date documents or questionable publications, even if indexed in a 
proper way , will no longer be retrieved since their corresponding low 
utility values do prevent this. 

The goal of this investigation is twofold: 

a) system improvement by suppressing documents which are likely 
to be non-relevant and promoting relevant item in the course 
of the retrieval process; 

b) system retirement by transferring out cf date and mediocre 
documents to an auxiliary file. 

2. The Methodology 

If the retrieval history of a set of queries in a system is known 
in the form of relevancy decisions of the user for the retrieved dociraients, 
one may use this information in deciding which documents are generally 
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useful and which ones are not. In an on line system the relevancy 
decisions are rendered as part of the feedback procedures and the 
booklceeping can be done in a very convenient way. 

The procedure proposed is to assign to each document in the 
collection a so-called utility value which is set equal to 1 
initially. This value is increased if a document is retrieved and 
found to be relevant » and decreased if retrieved and judged to be 
nonrelevant. The increment-decrement function is so chosen that the 
utility values range between 0 and 2 and are not able to exceed these 
values. (For a comprehensive description of this function, see tU , 
and Appendix 1. ) 

The utility value is then a system parameter which can be 
applied in the retrieval algorithm immediately since it reflects the 
utility as judged by the users. The new retrieval function Rf will 
be: 

Rf = cos(q,d) * U.V. 
That is the product of utility value (U.V. ) and the cosine correlation 
function (cos(q,d)). 

3. The Organization of the Experiments 
A. The Collection 

The SMART retrieval system provides the Cranf ield lUOO 
collection with 225 queries. This is a suitable collection for the 
experiments in that a maximum in updating of the utility values can be 
achieved. The query collection is split into an "updating collection" 
and a "test collection". Ihe iq>dating quexy collection is used to 
obtain utility' ''alues for all the retrieved documents while the test 
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collection serves as a means for evaluating the influence of the retrieval 
algorithm mentioned in 2. 

A set of 17 randomly chosen queries will serve as test collection. 
A comparison will be made between the retrieval results of the 17 queries 
without the use of and with the usage pf utility values. In principal 
208 queries were available to serve as updating collection. For practical 
reasons, however, a subset of 157 queries was actually used. 

In an on-line retrieval system, which will be simulated in these 
experiments, the number of feedback iterations is likely to be at least 
one. That means that relevanq/ decisions for the first iteration are 
available for 157 queries. These retrieved documents will be used to 
determine utility values for the retrieved documents. For every query 
30 documents are retrieved which requires in practice 30 relevancy 
decisions for each user. The nuniber is chosen rather high in order to 
obtain a fair amoimt of updating and therefore probably more significant 
results. 

B. The Updating Strategies 

The most realistic way to implement an updating procedure is to 
do so dynamically. That is, each utility value changed after a search 
will be used in the retrieval algorithm and influences the retrievability 
of that specific document in later searches. 

In the SMART environment the updating collection is run as a 
batch, the utility values being assigned afterwards and applied for the 
first time while rianning the test collection. (For a more complete 
discussion about static and dynamic updating of values see [1])* Similar 
experiments [2] concerning the updating of weighted dictionaries have 
shown that dynamic i^dating is superior to static results. If, for the 



XI-5 

experiments to be described in this report, the static approach 
turns out to be satisfactory a dynamic test might be useful at 
a later time. 

Three different updatizig strategies are applied, namely 

I The Straight tJpdating 
For each query in the updating collection, the 30 retrieved 
documents are updated in accordance with the unmodified increment- 
decrement function mentioned. The utility values are increased if a 
retrieved document is relevant, and decreased otherwise. An 
arbitrary factor in the updating function which governs the stepsize 
of the increment or decrement is chosen equal to 8 (according to Sage, 
Ref . 13] ). It is not a priori clear whether this value is optimal. 

II The Balanced Updating 

Since the average number of relevant retrieved documents for a 
query is approximately 5, an average of 25 documents will be decreased 
in utility value in strategy I. Thus for the whole collection the 
average utility value will become less than 1. Documents which are 
not ijpdated have therefore a hif^er probability of btting retrieved than 
updated ones since their utility value is still intact. 

To eliminate this effect the number of decreased documents 
is kept equal to the number of increased ones. Thus in strategy II, 
all relevant documents included in the 30 retrieved ones are 
increased, and an equal number of the highest ranked nonrelevant 
documents is decreased. 
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III The extended Balanced Updating 

A disadvantage of the balanced updating strategy II is that 
the number of updatings is much smaller than in the case of method I. 
In strategy I, 30 utility values per query are updated, whereas for 
strategy II this number is reduced to about 10 (5 values increased and 
5 decreased). 

In order to coni>ine the benefits of the maximum number of 
updatings (30) per query while preventing a decrease in the average 
utility value. Method III, the extended balanced updating is developed. 
In addition to the usage of all retrieved documents this method provides 
also an exact balancing. The sum of all the decrement steps is made 
equal for each query to the sum of all the increment steps. The 
decrement steps in particular will be chosen to be rank-dependent, 
that is, higher ranked nonrelevant documents are subject to a greater 
decrease in utility value than lower ranked ones. The rationale behind 
this is of course that hi^ ranking nonrelevant documents should be 
suppressed. An extensive treatment of Method III is given in Appendix 2. 

After applying these three strategies to the retrieval results 
of the query updating collection, one obtains a list of utility values 
which will be used in the retrieval algorithm for the query test 
collection: 

Rf = cos(q,d) X U.V. 

4. The Results 

A. The Straight Updating 

The results obtained using Method I (Fig. 1) are not promising. 
A possible explanation is the fact that almost all utility values are 
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decreased. The drop in average utility value promotes the retrieval 
of documents that were not updated and degrades system performance. 
The retrieval algorithm consists of the product of utility value and 
cosine correlation. For this reason documents with utility value 
equal to 1 are ranked high, even when their actual correlation might 
be relatively low. 

B. The Balanced Updating 

The results obtained with Method II are considerably bdtter 
(rig. 2) than those of the unbalanced updating method. They show that 
the idea of balancing to keep the average utility value approximately 
equal to 1, is justified. Still a detoriation of system performance 
can be seen. 

C. The Extended Balanced Updating 

The results with this rather complicated algorithm (see 
Appendix 2) are better than those for both Method I and II (Fig. 3). 
They indicate that taking into account all the 30 available relevancy 
decisions per query is advantageous. In Method II the number of 
documents to be decreased in utility value is chosen equal to the 
number of docum^^^ts to be increased. The stepsizes however are subject 
to the current size of the utility value, and therefore an exact 
balancing is not obtained. In Method III, however, the algorithm is 
devised such that for every query the sum of the increment steps is 
equal to the sum of the decrement steps, where all 30 retrieved 
documents are consid^ed. 

Since .the results with Method III are the best obtained, and 
since no self-evident new philosophy for a fourth method could be 
found, a more complete analysis of the Sage increment- decrement 
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function was carried out. This function fornis the basis of the iterative 
algorithm used in III* and as already mentioned » the constant C (see 
Appendix 1) which influences the stepsize has been set equal to 8 in 
all es^eriments. Decreasing this constant gives worse results, but 
increasing the value uields a remarkable system performance improvement 
(Fig. 4, 5, 6, and 7). A constant of approximately 17 turns out to be 
optimal in that the recall-precision curve for search results (Fig. 6) 
has been lifted maximally compared with the reference curve. 

It is a property of the algorithm that it is not particulary 
critical to changes in C. The starting value 8 was apparently much 
too small* that is» the updating step was too large (C occurs in the 
denominator). In the range 13 to 25 however, the retrieval results are 
not considerably affected by the change in C. 

A Student T test was carried out in order to verify the 
sign5.fi cance of the results obtained with Method III, Csl7. The 
outcome of the test (T statistic » 3.99 with 20 degrees of ft>eedom) 
indicates that th«pe is zero likelihood for the two sets of performance 
figures (reference versus Method III, C^ll) to have originated in the 
same distribution. 

5. Conclusion 

The e]q;>eriment8 described in this report support unmistakably 
the usefulness of a new system parameter called utility value. The 
assignment of such a function to each document in the collection 
provides means for Improving system performance. It has been shown that 
a careful approach of the concept is required, since it is found that 



XI-12 




XI 





Search Results Obtained with Method III, The Extended Balanced Updating 0=^17 

Fig. 6 

^ RecalJ*-^- * 

'Er|c 1 1 1-^ 1 1 1 1 1 h 

1 2 .3 5 6 7 8 9 



XI-15 




Fig. 7 



Recall * - r - 

^ ' I 1 1" ' t ' "I i I I »- 

EMC 1 2 3 U 5 S 7 8 9 



XI-16 

balancing — keeping the average utility value 1 — as well as the 
usage of heuristically determined constants are critical factors in 
applying a successful updating algorithm. 

It is clear that any document retirement policy based on 
document values can only be justified is such values have a realistic 
meaning, that is, after applying them system performance has to be 
improved. This investigation has proved that the latter may be possible. 
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Appendix 1: The Increment-decrement Function 

Initially, the utility value of each document is set equal to one. 
Prom that point the value is increased (or decreased) according to the 
relevancy decisions of the documents retrieved in response to each query. 

The specific increment-decrement ftinction chosen is the one 
proposed by Sage 13] , herein referred to as the "sine-function" because 
of its resemblance to the regular sine. If i is the retrieved document, 
define: 



v^ = the utility value of dociiment i 
(initially set equal to one) 

v^* = the utility value of docixmant i after updating 

Xj, = arc sin (v.-l), the transposed utility value. 

Then, v^ = 1 sin (x^) 

and similarly v^* is calculated by 

v.* a 1 + sin (x. + Ax.) (1) 
1 1 — i 

where Ax is a function of the old utility value, calculated as follows: 

ir/2 - |xj 
Ax^ ' " C 

C is an arbitrary constant. 

- Ax. is added in equation (1) if the retrieved document i is judged 
relevant by the user; or it is subtracted in the utility value calculation if 
the corresponding document is judged non- relevant. 
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Appendix 2 ; The Extended Balanced Updating 

Let q be a query which retrieves 30 documents and let 6^ 
be the increment of decrement stepsize of document i according to the 
Sine function. According to equation (1) of Appendix 1, 6^ is 
given by 

6^ = I v^ - v^ I = |sin X. - sin(x^ + Ax. ) | 
The sum of all increment steps for queiy q can be given as 

ieRel doc 

Since the number of non-relevant documents retrieved by query 
is considerably larger (25 to 5) than the number of relevant ones, the 
decrement steps of the utility values for the nonrelevant documents 
should be made smaller. A rank dependent decrement step is chosen 
such that low ranked nonrelevant documents will be decreased only 
slightly. Moreover, the function is adjustea by a free parameter such 
that the sum of all decrement steps is going to be equal to S • S 

is given by 

jenon*rel doc 

where A^ the actual decreraent of the nonrelevant document j , is 
chosen to be a linear function f ^ of the retrieval ranks , according 
to the equation 

6. 6. A 29 

= ^ = D<rj - iW 30 >- rj * ^oc (1) 
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D is constant for one query but its value is changed for the 

next one because the nuihber and ranks of relevant and nonrelevant 

documents change from query to query. The implementation of this method 

requires an iterative algorithm in order to find D for each query such 

that S = (2a) 

or I = i (2b) 

jeNon-rel Ooc ieRel Doc 

Consider first an illustration of equation (1) in crder to render 
the method clearer. A nonrelevant document retrieved in rank 1 will be 
decreased by 6., that is, the unmodified Sine function (f.^l). A nor 
relevant document retreived in rank 30 will be decreased by ^ (f » = D). 
All nonrelevant documents with ranks in between will be decreased subject 
to the linear function (1). 

The value of D on the ocher hand is determined by equation (2) 
and cannot be explicitly computed. An initial value for D must be 
chosen and equation (2) can be evaluated. D has to be changed in an 
iterative way until |s - S | < e, where e is small. 

1* ft 

The procedure guarantees an almost exact balancing of the amount 
of increase and decrease of utility values per query. 

Following is an example illustrating how the document values are 
changed due to one query. 

30 documents are retrieved by a query. Some are found relevant. 
The document values of these 30 documents, after a number of previous 
updatings, will be changed according to the relevancy decision of the 
present query (see Table 1). 

S^ the sum of the increased docusnent values for the relevant 
documents 320, 467, 322, and 321, is found to be 0.3760, using equation 



Doc. No. 


Rank 


Old Doc. 
.Value 


New Doc. 
Value 


320 R 


1 


1.0941 


1.1818 


476 R 


3 


0.9980 


1.0920 


322 R 


4 


0.9879 


1.0820 


321 R 


7 


0.9959 


1.0898 


107 


2 


0.9980 


0.9325 


478 


5 


1.0000 


0.9645 


479 


6 


0.9959 


0.9653 


734 


7 


0.9980 


0.9740 


190 


9 


0.9889 


0.9672 


452 


10 


0.9962 


0.9763 


1251 


11 


0.9697 


0.9518 


1163 


12 


0.9939 


0.9770 


1149 


13 


0.9817 


0.9661 


422 


14 


0.9979 


0.9832 


255 


15 


1.0916 


1 . 0886 


47 


16 


1.0564 


1.0438 


1162 


17 


0.9970 


0.9847 


837 


18 


0.9780 


0.9665 


1209 


19 


1.0000 


0.9889 


150 


20 


0.9883 


0.9778 


34 


21 


1.0000 


0.9899 


1254 


22 


1.0000 


0.9903 


979 


23 


0.9914 


0.9821 


626 


24 


0.9981 


0.9892 


1235 


25 


1.0000 


0.9914 


1370 


26 


0.9360 


0.9281 


538 


27 


0.9959 


0.9879 


425 


28 


0.9814 


0.9738 


818 


29 


1.0767 


1.0686 


363 


30 


1.0000 


0.9928 



Alterations in Document Values 
Table 1 



D 


5 


7 


9 


11 


13 


S 


0.86m 


0.6735 


0.5573 


0.4775 


0.4190 


is^ - s^i 


0.485<4 


0.2975 


0.2813 


0.1015 


0.0430 



The Variation of S with respect to D 
Table 2 



XI-23 

1 of Appendix 1. 

Aj«s are found Interatively by changing the value of D, 
such that is equal to S_ to within a tolerance, which is 
0.05 in this case. 

For the variation of for different values of D, see 

Table 2. 

For the present query, D s 13 is used. 
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Automatic Document Retirement Algorithms 
K. Sardana 

Abstract 

Some existing and proposed algorithms for automatic document 
retirement in a retrieval system are analyzed for their computational 
complexities and their effects on storage costs and the retrieval 
performance of the system, it is found that various retirement algorithms 
exhibit almost equivalent performance, especially at high recall; therefore, 
it is preferable to use those algorithms which provide low cost bounds. 

1. Introduction 

Two automatic document retirement algorithms have been proposed 
by Tai and Yang (referred to as TY in thr sequel) [11 and Beall and 
Schnitzer (referred to as BS ir .he sequel) [2]. It is, however, not clear 
what algorithm should be used in practice, what costs are involved in 
executing the algorithms, what savings aie obtained and at how much loss 
or gain in retrieval performance, etc. The idea of the present study 
is to look into these questions with special reference to TY algorithm. 
Some more algorithms are proposed and overall comparisons are made between 
different algorithms when used with and without document 'ector modification 
(DVM). i^l 

2. The Algorithms 

Both the TY and BS algorithms for document retirement are based on 
utilizing users' relevancy judgments in updating the documents. The 

ERIC 
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algorithms that we propose also rely on users' relevancy judgments. We 
assume that relevancy judgments on retrieved documents for various queries 
are available as input to the retirement algorithms. The computational 
complexity of the algorithms will consist of the costs of modifying the 
documents , computing **uce indices" for documents » etc. for the oxprons Ufiage 
of the algorithms and the cost of making retirement decisions • 

The general philosophy used here is to give the algorithms and 
their computational complexities. As we will not be able to give proofs 
of correctness of these heuristic algorithms^ we resort to experimental 
methods in the next section to evaluate the performance of the methods in 
an experimental environment. Finally overall comparisons between algorithms 
are ms^. , 

Criteria of Evaluation of Algorithms 

In determining the computational complexity of the algorithms » we 
assume the following: 

a) Asymptotic complexity will be used so that a machine independent 
cost analysis can be done. However, constants will be 
considered when finer decisions are involved. 

b) The model of the computing device is a random access machine 
which assumes that enough core memory is available for the 
entire program to fit in. 

c) The worst case computational complexity will be derived rather 
than expected complexity as the former is easier to get hands on. 

d) The basic steps in the computation to be considered for time 
complexity are additions (adds), multiplications (mults.) and 
comparisons (comps)« 
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e) The cost criterion is uniform rather than logarithmic i.e. 
a unit of each kind of operation will have some uniform 
cost regardless of the size of the numbers involved. 
A detailed discussion on choosing the above criteria may be 
found in [3] . 

Input: A set of documents {D} and a set of queries {Q} and users' 
relevancy judgments on {D} for {Q}. 

Output: A set of documents {D'} ^ {D} to be retired to a 

secondary store such that future retrieval processing with 
remaining {D} - {D»} documents using future queries is 
exp'-cted to be overall efficient considering the space 
s by retirement, gain or loss in system performance 
a cost of execution of the algorithm. 

Some Desirable Features of Algorithms 

To achieve the above mentioned goal of retiring documents, the 
following features (among others) of the algorithms seem to be desirable: 

a) Irretrievable documents or documents not relevant to any 
query should be retired. This, however, assun»«« that the 
indexing is perfect. 

b) The selection of various parameters for the algorithm should be 
pretty straightforward in any practical implementation. 

c) The algorithm should not assume any unnecessary attributes of 

the document space, e.g. documents with same average weights, etc. 

We note that each of the TY or BS algorithms does not have one the other 
of the above features. 
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Methods 

In the following algorithms, we mention the step to retrieve the 
documents and to do DVM for a query. However, the costs of these steps 
are not considered in determining the computational complexity of a 
retirement algorithm as these stej are not part of tho retirement 
algorithm, per se. 

It may be argued that the analysis of computational complexity of 
such isolated algorithms does not make much sense insofar as the envir- 
onment in which they are going to be used is known. Thus, for instance, 
one may conf.ider the effect of complexity of a retirement algorithm on the 
asymptotic complexity of the overall retrieval process. The cost of 
matching a query with n documents, each document having on the average 
m concepts, is 0(mn) and the cost of selecting top r documents for 
showing to the user is 0(n) using the median algorithm [5] implying the 
oveiodl retrieval process to be 0(mn)/query. Since most of the additional 
algorithms, like the retirement algorithms considered here, cost less than 
0(mn)/query (see later), the asymptotic complexity of the overall retrieval 
process does not change. It may, then, be concluded that the analysis 
of such algorithms is not important. The viewpoint taken here is that in 
document retrieval, where the cost of answering a query is quite high, 
the reduction ot costs of all sub-algorithms should be considered important. 

The alf.orithms are expressed in pseudo-Algol for clarity and ease 
in deriving the computational complexities. For convenience, we assume a 
kind of macro facility availal'le in the language with four keywords: 
"defmacro X" defines a macro named X, the code between begmacro and 
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endmacro is body of the the macro named X (assuming "definacro X" precedes 
this). The body of the macro X is textually substituted for call "macro X" 

Al) Algorithm TY (Tai and Yang) 

The TY algorithm [1] shown in Fig. i(a) is basically the following, 
suitably modified to work with a batch of queries (BATCHSIZE is the 
number of queries in a batch and BATCHNUM is the number of batches to be 
processed): 

a) For each query q.^ in a batch, retrieve r number of documentfs. 
Form the sets REL.^, NREL^^ and BOT^^ consisting of top RFLTOP 
relevant retrieved, at most NONLIM nonrelevant among top 
NONTOP retrieved and bottom-most ranking documents NUMBOT in 
number respectively. 

b) After doing dVM (optional) using -the queries in a batch, 
multiply the weights of concepts of document vectors in REL.. 
NREL^^ and BOT^^ by constants FR(>1), FN(<1) and FB(<1) 
respectively. 

c) After every document vector multiplication by FN or FB, compute 
the average weight AVGwT of the document and retire it if 
AVGWT < CUTOFF, some chosen constant. 

The idea of this algorithm is to reward the top relevant retrieved 
documents and to penalize the top nonrelevant retrieved and the bottom-most 
ranking documents for each query by respectively increasing or decreasing 
all the weights of the concepts of the documents by the same factor at a 
time. The information on the usefulness of a document is thus carried in 
the weights of the vector itself. When the average weight per concept of 
<i document falls below a chosen threshold, meaning that the document has 
been penaliK^d more than it has been rewarded, the document in retired. 
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Computat ional Complexity : 
Let 

n s total number of documents (roughly itOO-1500) in the special 
subject area of the query; thus for a query in Astrophysics, 
n is the number of documents in this particular subcollection 
rather than the whole Physics library. 

q = total number c*" queries in the same area = 0(n) say. 

TO = average number of concept s /document . 

r = number of documents retrieved. 

We also assume that 

i) IREL.^I + iNREL^jl + IbOT.^I : r, where |x| denotes 

the cardinality of set X. This is not an unreasonable 
assumption to make and conforms in practice, 

ii) |REL..|: cr for some constant c(<l). 

Let us first determine the cost associated with SETUP code of Fig. 1(b), 
used in line (7) of algorithm TY of Fig. 1(a). Costs associated with lines 
(8), (9) and (10) are constant. Cost of line (6) is "r comps/query (~ stands 
for approximately) because from the given ranked list of documents obtained 
from line (4), this many comparisons may be needed to form sets REL^j and 
NREL. .. 

Next we determine the cost of T6Y code of Fig. 1(c) used in line (10) 
of Fig. 1(a). Lines (4), (7) and (13) together cost "m mults/documents/query 
i.e. "rar multn/query over all documents in sets REL, NREL and BOT, Similarly 
cost of lines (8), (9), (14) and (15) over documents in sets NREL and BOT 
Is "mr addfi, "r muitr. (actually divisions to calculate AVGWT) and "r comps 
per query. 
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Line if Cost 
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1 
2 
3 
«f 
5 

7 
8 



"r comps/query 



"mr+r mults/query 
10 "wee adda/query 
-r conps/query 

12 



procedure RETIRE comment Tax and Yang Algorithm 
begin 

for j -f- 1 step 1 until BATCHNUM do 
begin 

REL = {({»}; NREL = {<^}; BOT = Wi 
for i 1 step 1 until BATCHSIZE do 
macro (SETUP); 

comment optionally do DVM in next step using (REL..,q 

pairs, BATCHSIZE in number 
macro (DVM); 
macro (TSY); 
end 

end 



) 



r coipps/query 



Line Cost 
1 
2 
3 

5 

6 

7 

8 

9 
10 
31 
12 
13 



Algorithm TY, Tai and Yang»s Algorithm 

Fig. 1(a) 

defmacro SETUP 
begmacro 
begin 

retrieve r documents for ith query in jth batch 

(call this query q..); 

1] 

form sets REL.., NREL.. and BOT..; 

xj 1] 13 

comment ♦ operation below denotes concatenation 
REL = REL • RELj^^; 
NREL = NREL • NREL..; 
BOT * BOT • BOT..; 
save (REL^j, q^ ^ ) pair 011 a list; 
end 
endmacro 



Definition of Macro SETUP 
Fig. Kb) 
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Line # 
1 
2 
3 
4 
5 
h 
7 



Cost 



IT 



"mr mults/query 
(steps U,7,13) 



9 
10 
11 
12 
13 
14 
15 
16 
17 



*'Mtt* adda/query 
8 t'r mults/query 
(steps 8,14) 

r comps/query 
(steps 9,15) 



defmacro T&Y 
begmacro 

for each document D in REL do 

multiply all the weights of the concepts by a constant fr; 
for each document D in NREL do 
begin 

multiply all the weights of the concepts by a constant fn 

calculate avg. wt., AV6WT of the concepts; 
if AVGWT CUTOFF then retire this document D; 
end 

for each document D in BOT do 
begin 

multiply all the weights of the concepts by a constant fb 
calculate avg.wt., AVGWT of the concepts; 
if AVGWT <, CUTOFF then retire this document D; 
end 
endmacro 



Definition of Macro T6Y 
Fig. 1(c) 



Line # Cost 
1 
2 
3 
4 
5 

6 "mr comps/query 
"mr mults/query 

7 "2mr adds/query 

8 



9 
10 



defmacro DVM 



begmacro 

for k <• 1 step 1 until BATCHSIZE do 
begin 

for each document D in RELj^j do 
for each concept C(l) belonging to D and q^^^ do 

W(l)» ' W(l) + a*(BIG-W(l)); 

comment W(l)» and W(l) are weights of concept 
C(l) before and after the operation, a and 
BIG are constants defined by Brauen [4] 

end 
endmacro 



Definition of Macro DVM 
Fig. 1(d) 
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Summing up, the total cost of algorithm TY is 
T + "r = '2r coraps/query = 0(r) 

*mr + > = > + mr mults/query = 0(mr) (1.1) 
"mr = **mr adds/query « 0(mr) 

or 0(mr) operations/query. 

A2) Algorithm BS (Beall and Schnitzer) 

The BS algorithm [2] shown in Fig. 2 works as follows: 

a) For each query q.^ retrieve r documents. Using some chosen 

parameters, form the sets RELl.. , REL2.., NREL.. and 

ij ij ij 

BOTZERO. j consisting of the top relevant retrieved, middle 

ranking relevant retrieved, top nonrelevant retrieved and the 

bottom-most i^anking or zero correlating documents. 

b) Using these sets, do a special kind of dVM by which concepts 
common to query q.. and each document in RELl.. or REL2.. 
are increased and concepts common to query q^^ and each 
document in NREL.^ or BOTZERO. ^ are decreased in weight 

at different rates. 

c) After processing a number of queries in this fashion (even 
though this is not explicitly stated in [2)), if a document 
has (i) less than NUM concepts of weight greater than MINI and 
(ii) the average weight of the document is less than MIN2, then 
this document is retired. NUM, MINI and MIN2 are parameters 
chosen for the algorithm. 

By this algorithm, the information about the usefulness of a document 
in carried in a more or less ad hoc manner, in the concepts common to the 
queries used for processing and the document. The retirement decision is 
made at the end using a careful examination of each document vector. 
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Line # 
1 
2 
3 
U 

t 
7 
8 
9 
10 



11 

12 
13 

15 
16 
17 

18 
19 

20 
21 
22 
23 
24 
25 
26 
27 
28 
29 

30 
31 
32 
33 
3«* 
35 
36 
37 
38 
39 
10 
41 
42 
43 



Cost 



"r compn/query 



IT 



procedure RETIRE comroent Beall and Schnitzer Algorithm 
begin 

for j 1 step 1 until BATCHNUM do 
begin 

for i ^ 1 step 1 until BATCHSIZE do 
begin 

retrieve r documents for ith query in jth batch 

(call it query qn); 
form sets RELlij, REL2ij, NRELij, BOTZEROn; 
save quintuple (RELlij, REL2ij, NREL^j, BOTZEROij, q^j) 
on a list; 

erd 

comment do the DVM using quintuples saved, BATCHSIZE in number 
for i 1 step 1 until BATCHSIZE do 
begir 

for each document D in RELlij do 



• m c<Haps/qu^y 
"3mr mults/query 
*3rar adds/query 



'mr comps /query 
'mr mults/query 



"mn comps/q queries 
**mn adds/q queries 
*mn adds/q queries 

"n mults/q queries 
''2n comps/q queries 



if concept C(k) belongs to D and q*^ then 
W(k)' « W(k) + ai*(BI6-W(k)); 

for each document 0 in R£L2|.j do 

if concept C(k) belongs to D and q^j then 

W(k)* = W(k) + a2*(BIG - W(k)); 
for each document D in NREL^j do 

if concept C(k) belongs to D and q*4 then 
W(k)' « W(k) - (W(k)/N + l)i 
for each document D in BOTZERO^ . do 

if concept C(k) belongs to D and qij then 
W(k)' = W(k) * B; 

end 

end 

CCTmnent make retirement decisions, n=no. of documents, 

m=avg. no. of concepts in docximent 
for i *«- 1 step 1 u ntil n do 
begin 

INUM <- 1 ; AV6WT *■ 0 
for j 1 step 1 until m do 
begin 

if weight of concept C(j) in document i, W(C(j), ) 

> MINI then INUM INUM + 1; 
AV6WT *■ AVGWT + W(C(j)^) 
end 

- AVGWT *■ AVGWT/m 

if INUM < NUM and AVGWT < MIN2 then 
retire the document i; 

enJ 



end 



Algorithm BS, Beall and Schnitzer' s Document Retirement Algorithm 

Fig. 2 



In the BS algorithm, the special kind of DVM is an integral 
part of the algorithm; this is actually a drawback since the algorithm 
cannot be used unless DVM is also desired. The cost of this algorithm 
including DVM cost (from cost column in Fig. 2) is 

-r + |- m + mr+Bil+|i = -|^ + y+ , O(^) comps/query 

*3mr + mr + I s -4^? + ^ s 0(mr) mults/query (1.2) 

3mr + ~+ ~ s-anr + i^ = 0(mr) adds/query 

To make a fair comparison of this algorithm with the previous one, 
we include the cost of doing DVM along with TY algorithm also. Then the 
total cost of TY algorithm (from equations (l.D) and DVM cost (from the 
cost column in Fig. 1(d)) is 

''2r + mr « ~mr + 2r = 0(mr) comps/query 

"mr + r + mr = *"2mr + r = 0(mr) mults/query (1.3) 
"mr + mr s "2mr = 0(mr) adds/query. 

Thus the asymptotic time complexity of BS algorithm considering the 
DVM cost is 0(mr), same as that of TY algorithm. But considering constants 
in equations (1.2) and (1.3), BS algorithm seems really the costlier of 
the two. 

A3) Algorittan SI 

The algorithm SI (Figs. 3(a), 3(b)) is a variant of TY algorithm and 
is described below. 

a) Initialize the value of the separately assigned storage location 
USENDX of each document to INIT. 



It.'*-*' 
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Line » Costs 



1 
2 
3 
4 
5 
6 
7 
8 



9 "r comps/query 
10 
11 
12 

^2 Alg. Slt'p mults/query 
Alg, S2:~r adds/query 

14 

15 



procedure RETIRE comment variation of TY algorithm, SI S S2 
begin 

for i 1 step 1 until n do "USENDXCi) = INIT; 

comment n number of documents 
fa£_ j f 1 step 1 until BATCHNUM do 

begin 

REL = {(!»}; NREL = {*}; BOT = {4.}; 
fw i 1 step 1 until BATCHSIZE do 

macro (SETUP) j 
comment optionally do DVM in next step using 

(REL^^, q^j) pairs, BATCHSIZE in number 
macro (DVM); 



end 



if algorithm SI is desired then macro (SI) 

else macro (S2); 



for i 1 step 1 until n do 

if (USENDX(i) < CUTOFF w USENDX(l) = INIT) then 
retire the document ij 

end 

Algorithms SI and S2 

depending upon the algorithm desired 



2x1 

16 ~ comps/quwy 

17 

18 



corresponding named macro 
is expanded in line 13 above 



Fig. 3(a). Ma) 



Line § Cost ; 
1 
2 
3 
4 
5 



6 ""r mults/query 
7 
8 
9 



mujL 

1 



defmacro SI comment for algorithm SI 
begmacro 
for each document i *n REL do 

USENDX(i) ^ USENDX(i) * FR; 
for* each document i in NREL do 

USZNDXd) f USENDX(i) * FNj 
for each document i in BOT do 

USENDX(i) f USENDX(i) * FBj 

endmacro 



Definition of Mocro Gl for Uiic in Algorithms V,l 

Fig. 3(b) 
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b) Retrieve r documents for query q^^. Form the sets REL. ^ , 
^REL^^ and BOT^^ just like in TY algorithm. 

c) After doing the DVM (optional) using the queries in a batch, 
multiply the USENDX of a document by rR(>l), FN(<1) or 
FB(<1) according as this document appears in the set REL^j, 
NREL.j or HOT respectively. 

d) After processing a BATCHNUM number of batches of queries 

using steps (b) and (c) repeatedly, if a document ♦s USENDX = INIT 
or USENDX <^ CUTOFF, some chosen parameter, then this document 
io retired. Go to step (a). 

Note, here DVM does not directly take part into retirement decisions 
as in TY algorithm, but does so indirectly by choosing which documents get 
retrieved or not thus affecting their USENDX values. The main difference 
between TY algorithm and the present one is: The former uses AVGWT of the 
documents to denote the usefxjl index for a document and, therefore, since 
the document vectors are modified by DVM and the multipliers FR, FN or FB, 
AVGWT needs to be computed for each document vector before making a 
retirement decision. The latter algorithm uses a location USENDX for each 
document and its value gets modified by multipliers FR, FN or FB while 
not by DVM directly. The retijcement decision is based on USENDX values. 

Another difference is that as shown, the TY algorithm makes 
retirement decisions arter processing every batch of queries (5 used here) 
while SI algorithm does so after processing a set of queries (125 used 
here). But both algorithms may be adjusted to any retirement frequency, in 
which case the asymptotic cost of TY algorithm does not change while the cost 
of SI algorithm may approach 0(n) when retirement decision is made after 
processing every query. 
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Computational Complexity: 

As shovni in the cost column of Fig. 3(a) » the cost of this algorithm 



or 0(r) operations/query since q = 0(n). 

This algorithm is thus asymptotically better than TY algorithm in 
time complexity by a factor of m, the average number of concepts in a 
document vector. However the space complexity has been increased by 
0(n) in using probably one computer word or halfword USENDX location for 
each of the n documents. This is reasonable considering that the space 
required is on a cheap off-line device while the time saved decreases 
the important on-line response time. 

At) Algorithm S2 

This algorithm (Figs, '♦(a), '♦(b)' another variant of TY algorithm, 
resembles algorithm SI cl*? 9ly. Here in step (b) the USENDX of a document 
is increased or decreased by constant values by additions or subtractions 
rather than by multiplications (as is done in algorithm SI) whenever a 
document ends up in the set REL.j or NREL.^ or BOT.^. 

Computational Complexity: 

From the cost column of Fig. *t(a), the time complexity of this 
algorithm is 




mult s/ query = 0(r) mults/query 



comps/ query « 0(r) comps/query 



(1.4) 



~r + 



-2n 

q 



comps/query = 0(r) comps/quei'y 



(1.5) 



adds/query = 0(r) adds/query 
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Line # 



Cost 



13.1 
13.2 
13.3 
13.4 
13.5 
13.5 
13.7 
13.8 
13.9 



"r adds/query 



1 



defmacro S2 comment this mar.^o is for algorithm S2 
begmacro 

for each document i in REL do 

USENDX(i) USENDX(i) + FREL; 
for each document i in NREL do 

USENDX(i) USENDX(i) + FNREL; 
for each document i in BOT do 
USENDX(i) USENDX(i) + FBOTj 
endmacro 



Definition of Macro S2 for Use in Algorithm S2 
Fig. 4(b) 
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or 0(r) operations/query. 

Remarks similar to algorithm SI apply here also. 

A5) Algorithm S3 

Algorithm S3, fig. 5, differs from algorithm S2 in step (b) in two respects: 

i) Instead of assigning a fixed positive USENDX to a document 
appearing in set ^I'^j algorithm S3 assigns a variable 
index ^ depending upon the number p of relevant retrieved 
documents (i.e. p = |REL.j|) for a query. The idea is to 
assume that every relevant retrieved document is useful 
in satisfying — th of the query. However for nonrelevant 
dociiments in set NREL^^, a constant negative use index 
is assigned. 

ii) The bottom set of documents, BOT^^ is not considerea in the 
hope that KEL^j and NREL. ^ sets are enough in determining the 
use indices of documents. 

Computat ional Complexity : 

The time and space complexity of this algorithm is the same as 
that of algorithm S2 and is given by equations (1.5). 



3. Experimental Results 

Since the performance of document retirement algorithms (like 
most information retrieval algorithms) depends upon the unpredictable 
future queries, it is meaningless to talk about a proof of correctness 
of such algorithms. Therefore, we resort to experimental methods to 
evaluate the success of these algorithms. 
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Line jjf Cost 

1 procedure RETIRE comment for algorithm S3 

2 be^in 

3 for i 1 step l until n do USENUX(i) = 0; 

comment n « number of documents 

5 for j 1 step 1 until BATCHNUM do 

6 begin 

^ 1^ ^ ^ ^ step 1 until BATCHSIZE do 

begin 

9 retrieve r documents for ith query in jth batch 

(call this query q^j); 

11 'r comps/query form sets REL.. and NREL..; 

12 comment let JREL..| = p 
12.5 if p ^ 0 then 

13 for each document k in RELj. ^ do 
"v adds/query USENDX(k) ^ USENDX(k) + i; 

15 j . for each document k in NRELj. j do 

1^ Jl USENDX(k) USENDX<k) + FNREL; 

17 save (REL^^. q^^) pair; 

18 end 

19 macro (dvM); 

20 end 

21 fOT i -e 1 step 1 until n do 

22 comps/query if (USENDX(i) < CUTOFF or USENDX(i) = 0) then 

23 retire the doc\»nent i; 

24 end 



Algorithm S3 
Fig. 5 
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The description of test envjj?onment used is as follows: 



Retrieval System 
Testing Method 
Document Collection 
Document Cluster Collection 
Query Collection 

a) >Iumber of Test Queries 

b) Number of Control Queries 



SMART 

Test and Control groups [1J 
CRNUS DOCS (424 documents) 
CRNUS TREEl 

CRN4S QUESTS (155 queries) 
125 

30 (15 Similar + 15 Nonsimilar 
to Test Queries) 



A) Testing Procedure 

In our testing method > we compare the performance of 30 Control Queries 
on original document collection and on the document collection modified by 
retirement using 125 Test Queries • This simulates the practical on-*line 
situation* We mention that TY [1] and BS [2] have evaluated their algorithms 
by comparing the performance of all 155 queries on the original document 
collection and on the document collection modified by retirement using the same 
155 queries* They assume a fixed set of queries to be used by the users 
and better results are and would be obtained for this situation which is« how- 
ever ♦ not a general realistic one* 

B) Choosing Parameters for Retirement Algorithms 

One of the parameters to be decided for document retirement is the 
time span after which to retire. The retirement may be done after processing 
a fixed number of queries or after a fixed time. It seems that the values 
of these parameters depend upon usage characteristics of Individual sub- 
collections within a collection. Moreover, their optimum values would need 
to be determined for each subcollection experimentally. The experiments 
conducted here make retirement decisions after processing a batch of 5 cjucries 



for the TY algorithm and after processing the total number of 125 test 
queries for the others. 

A discussion on choosing the other parameters follows. 

a) Algorithms Tf and SI 

We consider a procedure for choosing positive parameters 
FR(>1), FN(<1) and FB(<1) used in algorithms TY and 61, Suppose initial value 
of USENDX assigned to each document is INIT; in case of algorithm TY, INIT 
is the average weight of each document assumed to be constant over all 
documents . * 

Notation 3.1 : Let count 1/m/n of a document for nonnegative integers 
1, m and n stand for 1, m and n appearances of the document in classes REL, 
NREL and BOT respectivelj'. 

Then the final value of USENDX for this document is: 

INIT * (FR)^ * (FN)"* ft (FBl^^. (3.1) 

Definition 3.1 ; A count Ij/jn^^/n^ is equivalent to (less than) 
a count ^2^^2^^2 ^ document if the final value of USENDX obtained by 
ij^/m^/n^^ is algebraically equal to (less than) the final value of USENDX 
obtained by l^/m^/n^ i.e. 

(FR)^l ft (FN)'"l ft (FB)"l = (<) (FR)''-2 * (FN)"*2 ft (FB)"2. 

Lemma 3.1 : Suppose that for algorithm TY without DVM and for 
algorithm SI with or without DVM, a appearances of a document in class REL 
are offset by v appearances of the same document in class NREL or by c 
appearances of the same document in class BOT, i.e.: 
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a/b/0 s 0/0/0 5 a/O/c, (3.2) 

then 

_ a 
FN = (FR) ^ and FB = (FR) ° . ^^'^^ 

Moreover, to retire all the documents with counts 1/m/n or less, 
the retirement cutoff for USENDX may be taken as 

CUTOFF = INIT * (FR)"'" * (FN)" * (FB)" = INIT * (FR)"^""**! " "*c (3 

a 

Proof : a/b/0 s 0/0/0 (FR)^ * (FN)^ = 1 =^ FN * (FR) ^ 

a 

a/O/c 5 0/0/0 ^ (FR)^ (FB)° = 1 => FB = (FR) ^ 
Rest of the lemma is obvious. 

Note that the above lemma does not obviously apply to algorithm 
TY used with DVM since the average weight of a document also gets changed 
by DVM process and equation (3.1) may not hold. 

Now the problem of choosing parameters FR, FN, FB and CUTOFF boils 
down to comparatively easy problem of choosing reasonable values of a, b, c 
and 1, m, n and any initial values INIT and FR, both >1. 

Example : 

Let a = 1, b = 2, c = 8, i.e. 1/2/0 = 0/0/0 = 1/0/8. Also assume 
1 s 0,m = 3, n s 0, i.e. we decide to retire all documents with counts 
< 0/3/0. 



Choose INIT = 12 

FR = 1.56 



a 1 



Then equation (3.3) FN - (FR) ^ = (1.56) j = 0.8007 

and FB = (FR) ^ = (1.56) ® = O.^JHb 
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Equation (3.4) ^ CUTOFF = INIT * (FR) 




^ ' 6.17 



s 12 * (1.56) 



We also note that there seems to be no straightforward way to choose 
parameters RELTOP, NONTOP, NONLIM and NUMBOT. However, the experiments done 
here tended to support that with NUMBOT = 10, the results were a little better 
than with NUMBOT = 0. The following seemingly reasonable values of these 
parameters as used by Tai and Yang were mostly used throughout experiments 



done here: 

RELTOP = 30, NONTOP = 6, NONLIM = 5, NUMBOT « 10. 
b) Algorithm S2 

Here since USENDX*s are changed by additions, so for a docimtent 
with count 1/m/n the final USENDX value is 



where FREL, FNREL and FBOT are param^sters used here corresponding to FR, FN 
and FB used in algorithms TY and SI. 

With this difference, the ana'ysis is similar to that done previously. 



INIT + (1 * FREL + m * FNREL + n * FBOT) 



(3.5) 



Example ; 



Choose a, b, c such that 1/5/0 s O/O/O = 1/0/20 



INIT =0.0 



FREL « 0.5 



Then FNREL = -FREL * - = -0.5 * t = -0.1 

o a 

FBOT = -FREL * | = -0.5 * « -0.025 
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To retire documents with count <. 0/0/10 

CUTOFF = INIT + 10 * FBOT = 0.0 + 10* -0.025 = - 0.25 

c) Algorithm S3 

In this algorithm, documents appearing in BOT set are not considered. 
To retire documents with count <. 0/10/-, choosing a "reasonable" value of 
FNREL = -1/20, use 

CUTOFF^="lO. * FNREL = - 10. - 0.5. 

We further note thei following: 

1) In addition to retiring documents with counts 1/m/n or less, it 

is probably desirable to retire documents whose USENDX does not 

change at all* Such documents have either count of 0/0/0 

meaning that they are either never retrieved or have some count 

l./m./n. sucl' that 
111 

(FR)H'* (FN)"*! * (FB)\ = 1 for TY and SI algorithms 

or (3.7) 

1. * FREL + m. * FNREL + n * FBOT « 0 for 82 and 

S3 algorithms. 

The latter category of documents may be useful although the iuems 
are still retired; however, the number of such documents is 
expected to be small. Since there seems to be no particular 
reason why this "latter category" of documents should be retired, 
we may modify the algorithms to prevent their retirement. There 
are two ways; 



i) Choose FR, FN and FB (or FREL, FNREL and FBOT) in such 
a way that the probability of (3.7) getting satisfied 
for some nontrivial count 1^/m^/n^ is close to zero. 
For example, in (3.6) choose FNREL = - 0.1 + e^, FBOT = 
- 0.025 + with FREL = 0.5 for some small values of e 
and e^. 
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ii) Depending upon the algorithm being used, choose the parameters 
FR > FN > FB > 1, INIT ' 1 and CUTOFF > 1 (or FREL > FNREL > 
FBOT > 0, INIT = 0 and CUTOFF > 0). Since the USENDX of a 
document may only increase, (3.7) will never be satisfied for 
some nontrivial count Ij/ro^/n^* This means that the documents 
retired based on the criterion of unchanged USENDX can only 
have count 0/0/0.* 

In the experiments done here none of the above approaches is 
used. Thus, for instance, no particular attention is given 
to choosing the parameters in the sense of above approach (a). 
However, it is found in the experiments that about 2% 
documents are in this "latter category" and their retirement 
actually does not degrade the performance significantly. 

In practice, probably the second approach should be used 
which has another advantage also: FREL, FNREL, FBOT and 
CUTOFF may be chosen to be positive integers and since the 
USENDX »s are reinitialized after processing a group of 
queries, the magnitude of a USENDX would fit in a half word. 
Thus the storage required for USENDX »s is reduced to half. 

2) If it is desired to retire a fixed % of documents say e.g. 
to maintain a fixed number of active documents in the store, 
then the retirement cutoff may be determined as follows: 



*This approach was actually considered at the time the project was originally 
conceived . At that time it was felt that it might be desirable to use all 
the past information on the usefulness of a document in making future 
retirement decisions. So after making every retirement decision, the 
USENDX* s of documents should not be reinitialized. In the case of the 
approach being considered, this means that all the USENDX »s would grow 
without bound requiring unb.3unded storage. Therefore, this approach was 
abandoned and the algorithms were programmed as shown in the text. Presently, 
however, it seems that reinitializing of USENDX's after every retirement 
decision step may actually help in keeping the document space more up-to-date, 
(In practice, the optimum frequency of reinitialization of USENDX 's may have 
to be determined experimentally.) 

. .. J. 
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If d documents are to be retainea, then from the USENDX*s 
of n documents determine the d + 1st largest USENDX by the 
median algorithm C51 in 0(n) steps. Take this value to be 
CPTOFF (note some care may have to be taken to retain d documents 
in case dth and d -t- 1st largest USENDX 's are the same)* 

This procedure takes 0(n) comps/q queries or 0(n/q) c<^ps/qu^y 
of 0(c) comps/query for some constant c because we have assumed 
q « 0(n). ThuSf the asymptotic cost of the algorithm does not 
change with this technique. 

We also note that for a fixed retirement, algorithms SI and 
S2 yield the same resiilt; thus it is preferable to use the 
cheaper algorithm S2 in this regard. 

C) P-R Curves 

The Precision-Recall (P-R) curves obtained for the various algorithms 
without and with DVM are shown in Figs. 6-9. BS algorithm's performance 
curves are not obtained since this algorithm is quite costly and its results 
are not expected to ta better than those of TY algorithm. For each algorithm 
the P-R curves are given for the original document collection and the collection 
obtained after various percentages of retirement of documents by the respective 
algorithms. Observations from these curves are briefly summarized below: 

i) The performance almost monotonically degrades as the retirement 
rate is increased for any of the algorithms, with or without 
DVM. 

ii) The rate of degradation of performance with the increase in 
retirement rate with or without DVM, is the smallest for Tl 
algorithm upto 0.5 recall while for recall beyond 0.5 all the 
algorithms seem to fare equally bad. Typically, for about 
18% document retirement without DVM at 0.5 recall level, the 
losses in precision for algorithms TY, SI, S2 and S3 are 
0.050, 0.055, 0.072 and 0.070 respectively. 
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iii) The only difference in retirement methods with and without 
DVM is in that all the P-R curves with DVM are higher than 
the corresponding curves without DVM but relative to each 
individual group, the performance is similar in both the cases. 

iv) For retirement upto 10% or so, the performance is much like 
the one with the original collection for TY algorithm. The 
same is true for other algorithms upto retirement of about 
6%. However, the retirement of 6% or so for algorithms SI, 
S2 and S3 is obtained by retiring only the documents whose 
USENDX remains stationary after processing 125 queries since 
CUTOFF values of 0.000 for SI algorithm and of -1.000 for S2 
and S3 algorithms were so chosen that no USENDX of documents 
would be below these values. 

This means that for collections like CRN1400 which contains 
avound 30% documents not relevant to any query, since USENDX 
for such documents has a high probability of remaining 
invariant, such idle documents would be retired by either 
of the algorithms SI, S2 or S3 without affecting the retrieval 
performance. The same may not hold for TY algorithm. 

v) Only S3 algorithm was tried for document retirement as high as 
64% and that too only without DVM (Fig. 9(a)). The performance 
gets progressively worse coiApared to the performance of the 
original collection. It is then expected that same would be 
true of other algorithms also even though it is not very clear 
as to by how much the TY algorithm will deteriorate at such 
high retirement rate. 

Overall Comparison of Various Algorithms 

Figs. 10 and 11 give the overall comparison of various algorithms. 
Fig. 10 given the sample comparison between the various algorithms - at 
17-l'j% retirement without DVM and at 26-28% retirement with DVM. It is found thai 
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most of the curves are clustered together beyond 0.5 recall while some 
diff^ences are apparent at low recall especially from the inserted recall- 
psrecision tables. Except for retirement of less than 10% when all the 
algorithms give performance close to the original, TY algorithm seems to 
give overall better P-R curves. For this reason and because the running 
time of TY algorithm is O(rar) operations /query compared to 0(r) operations/ 
query for algorithms SI, S2 and S3, the latter algorithms are compared with 
TY algorithm closely at different retirement rates. 

The results are tabulated in Fig. 11 for with and without DSM cases 
separately. The following observations are made: 

i) Upto about 7% retiranent, algorithm S3, S2 and SI give results 
equivalent to the original document collection and/or better 
than TY algorithm. 

ii) For 8 - 17% retirement algorithms SI, S2 and S3 give performance 
equivalent to or a little worse than TY algorithm. 

iii) For 18 - 28% retirement, algorithms SI, S2 and S3 give performance 
starting from equivalent to worse than TY algorithm. 

iv) Algorithm SI seems to do overall better than S2 and S3 algorithms. 

Considering that algorithm- ^^1, S2 and S3 cost 0(r) operations/query 
compared to 0(mr) operations/query for TY algorithm and that the former 
algorithms give performance almost equivalent to the latter especially for 
high recall, a safe conclusion may be made that cheaper algorithms should 
be used for document retirement. 
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5. Summary and Conclusions 

This paper has attempted to study the behavior of various automatic 
document retirement algorithms including the previously known ones anu some 
new proposed algorithms. The cost of using these algorithms is analyzed and 
the effects of these algorithms the storage costs and performance of the 
retrieval system are examined. The conclusions are summarized below: 

i) Without cost consideratioa, algorithm TY probably is the best of 
the algorithms tried. This algorithm costs 0(mr) operatious/query, 

ii) .Allocating a computer word or so for USENDX for each docxament 
i.e. at the expense of 0(n) additional space (this is, however, 
cheap off device storage), the running cost using algorithm 
SI is 0(r) operations/query with performance very much like 
algorithm TY especially at high recall. Algorithm SI is 
m (on the average m = 200) times faster than algorithm TY in 
running time. 

iii) The performance of algorithm SI is like algorithm TY for upto 

about 18% retirement and even then SI gets worse for low recalls 
(£0.5) mostly. 

iv) With algorithms TY and 81, the performance with upto 13% 

retirement is pretty clos-s to th^ performance of the origitial 
document collection. 

v) Even with algorithms S2 and S3 which vq^aiv^ 0(r,> aciriiv.lonai 
off-line space ior USENDX 's of documents and cost tha least 
in time complexity (i.e. "v comparisons/query), "the performance 
with upto 10 - 12% retirement is Vcixy cloae to the wifc^lnal 
performance . 

vi) With algorithms SI, S2 and S3, a retirement of 5 - lOt that 
retires all the documents whose L'3ENUX*s do not change after 
processing q (*'125 or so.' queries keeps the system performance 
almost exactly equal to the original perf csrmance . Note: as 
said at the end of Sectiou 2(b), such retirement surely retires 
documents which never get retrieved nor end up in BOT set. 
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The documents which are not relevant to any quesries also 
have a very good chance of retirement. 

Implementation of retiring such documents with TY algorithm 
is difficult since initial AVGWT of each document is different 
and even if these values are calculated for each document, 
they need to be stored and this requires 0(n) space which the 
TY algorithm mainly tries to avoid. 

vii) In view of the previous point, it is expected that with document 
collections like CRNIUOO, which contains almost 30% documents 
not relevant to any queries at all, the algorithms SI, S2 and 
S3 will retire at least these 30% documents or so, without 
any loss in performance. (In practice, the niunh< of such 
retired documents would be less than 30% since some of these 
documents will have counts 1^/m^/n^ such that Final USENDX i INIT.) 
This may not be true of TY algorithm. In such situations, 
algorithms Si, S2 and S3 are expected to perform bettor than 
TY algorithm. 

viii) Fc^ all the algorithms tried, the performance gets almost raonon- 
tonically worse with the increase in the retirement rate. 

ix) The BS algorithm, which is not tried, is expected to perform no 
better than T£ algorithm while it is the most expensive (in 
time complexity) of the algorithms considered. 

x) The rate of degradation of performance with the increase in 
retirement rate, with or without DVM , is the smallest for TY 
algorithm upto 0.5 recall while for recall beyon*^ 0.5 all 
the algorithms seem to fare equally bad. 

*x) The only difference in retirement methods w'th and without DSM 
is in that all the performance curves with DVM are higher .than . 
the corresponding curves \/ithout DVM. However, relative to 
each individual group, the perfohnance is r.imilar in both the 
cases. 
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xii) Overall, the kinds of algorithms to be used against the 

possible percentages of retirement may be stated as follows: 
Retirement % Algorithm Recommended fin nrvj^r.^ 

^"•^O* S3, S2. SI, TY 

^^"^^^ S2, S3, SI, TY 

SI, S2, S3, TY 

xlii) To summarize in one sentence, the main conclusion of this 
^ork is that various document retirement algorithms obtain 
almost equivalent performance, especially at high recall; 
so it is preferable to use cheaper alg«.ithms. 

6. Suggestions for Further Research 

As a result of the present work, the foUowing questions, which 
are still unanswered, might be well worth looking into: 

i) It is not quite clear why the tried version of TY algorithm, 
which does not retire the documents with unchanged AVGWT 
but does make retirement decisions after every batch of 
queries, performs better than other algorithms, especially 
its close variant - SI algorithm, at low recall. The main 
suspected reason is that in TY algox .hm, the retirement 
decisions are made more frequently (the cost obviously 
increases with this frequency) which keeps the document space 
more up-to-date. This implies the need for determining the 
optimum time span between successive retirement decisions, 
with tradeoffs between cost and performance. It seems that 
81 algorithm which m^gs retirement decisions after this 
optimum time span should give the overall best performance, 
ii) What should be optimum frequency of reinitialization of USENDX 
of documents? How does it affect the cost and performance of 
the system? This may have to be determined experimentally 
in the practical environment being considered. 
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iii) How will random retirement of documents behave compared 
to other algorithms? 

iv) How will the relative results of different alg<»?ithms vary 
- with the use of different collections? In addition » it would 
be nice to see some research done to resolve the following 
quest ions 4 

v) How can the ideas of obsolescence of literature based on 
statistical analysis of the collections > as used by Brookes 
[6] » etc 4 be combined with the document retirement based on 
relevancy decisions? We feel that the analysis of this question 
would require the use of collections larger than the ones 
presently available for the SMART system. 

vi) How will the retirement and/or the promotion of documents 
between different levels of storage affect the cost of 
transportation of documents, cost of modi^ing the clusters 
and/or the cantroids* optimum reorganization points for the 
data base* cost of modifying the thesaurus, stability of the 
document space, etc*? How will the retrieval pe:rformance 
which also considers the system efficiency factor (not 
considered in this report) i»e. the time required to search 
through the hierarchical data base to find the desired 
information, behave as a result of these? 

vii) How much of the work done and ideas used in paging algorithms 
of operating systems may be useful in the area of document 
retirement? 

viii) Lastly, it might be nice to see all the ideas of document 

retirement put together into making a viable model and theory 
of document retirement. 
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